Friday 10 May 2024

Celebrity problem in Data Sharding

 

Data sharding is a technique of dividing the large datasets into smaller, more manageable segments known as shards, which are subsequently spread across numerous servers or machines for efficient management.

Imagine you have a large database containing user information, such as user profiles, preferences, and transactions. To efficiently manage this data across your three servers, you decided to shard it into three separate parts or shards.

 

Shard Calculation

You employ a simple sharding algorithm where you determine the shard for each user based on their user ID. Using the formula userId % noOfServers = userId % 3, you calculate the shard number by taking the remainder of the user ID divided by the total number of servers (3 in this case).

 

Data Distribution

a.   Server 0: All user data where (userId % 3 == 0) will be stored on Server 0.

b.   Server 1: All user data where (userId % 3 == 1) will be stored on Server 1.

c.    Server 2: All user data where (userId % 3 == 2) will be stored on Server 2.

 

For example:

a.   User with userId 6: 6 % 3 = 0 (stored on Server 0)

b.   User with userId 7: 7 % 3 = 1 (stored on Server 1)

c.    User with userId 8: 8 % 3 = 2 (stored on Server 2)

d.   User with userId 9: 9 % 3 = 0 (stored on Server 0) and so on…

 

Query Routing

When a query or request is made to retrieve or modify user data, the system first determines which shard the data belongs to, based on the user's ID using the same sharding algorithm. Then, it routes the query to the appropriate server where that shard is located.

 


Scaling

As your database grows or your workload increases, you can add more servers and redistribute the shards accordingly to maintain performance and accommodate the additional data.

 

What is Celebrity problem in Data Sharding

When famous people, like celebrities, get a lot of attention, it creates a hotspot. This means that a ton of people want to read or write things about them. So, servers that store data about these famous folks can get really busy. For example, let's say 'Brad Pitt' and 'Robert Downey Jr.' have user ids 4 and 7. According to the sharding rule, their data goes to Server 1. Since they're so famous, lots of people want to see their profiles and latest updates. This puts a big load on Server 1 compared to others.

 

How to address above problem?

Here's a rephrased version of the points you provided:

 

Shard redefinition:  If some celebrities are constantly hogging the server's attention (high traffic), consider reshaping how you store their data. Spreading it across multiple servers (shards) will prevent any single server from getting overloaded.

 

Cache the data:  To improve response times, store frequently accessed celebrity information closer to users using a cache. This way, the system doesn't have to constantly retrieve data from the main database, reducing load and speeding things up.

 

Vertical Scaling:  If specific servers are struggling under the celebrity crush, consider giving them a hardware upgrade. This could involve adding more processing power (CPUs), expanding memory, or improving network capacity for smoother data flow.

 

Shard duplication and distribute the load: Instead of storing all data related to a celebrity on a single shard, duplicate their data across multiple shards. This means that copies of the celebrity's data will be stored on different servers, distributing the load associated with their popularity more evenly. For example, you could replicate data for highly popular celebrities like 'Brad Pitt' and 'Robert Downey Jr.' across multiple shards or servers to ensure that no single server becomes overwhelmed with traffic.

 

To mitigate the issue of high traffic on Shard 1 due to celebrity data, we can implement a solution by duplicating Shard 1. This ensures that the data from Shard 1, which experiences the most traffic, is available on more than one server. Consequently, the workload is evenly distributed among all servers, promoting better load balancing and improving system performance.

 


Since ‘Shard 1’ experiencing more traffic, I duplicated this across three servers, where as other two shards data is experience less traffic, 2 copies might be sufficient for them.

 

                                                  System Design Questions

No comments:

Post a Comment