This post explains how modern databases handle large amounts of data while staying fast and reliable using methods like sharding, replication, and denormalization. It looks at how databases have evolved from traditional relational systems to modern NoSQL ones like MongoDB and Cassandra, describing how they work and the pros and cons of each.
By the end, you’ll understand the main ideas behind distributed data design and how to make smart choices that balance consistency, availability, and scalability.
1. The Need for Scalable Databases
Modern applications from social media platforms to e-commerce giants and IoT networks deal with huge amounts of data and user activity. Every click, view, purchase, sensor signal, and recommendation translates into a database operation. As businesses grow, so does the need for databases that can scale seamlessly handling billions of records, millions of users, and continuous uptime expectations.
This section explains why scalability has become a first-class requirement for modern databases, where traditional relational models fall short, and how horizontal scaling and distributed systems changed the game.
1.1 The Rise of Data Volume and Concurrent Users
In the early 2000s, a typical enterprise database might have handled a few gigabytes of structured data and a few hundred concurrent users. Today, systems like Instagram, Netflix, and Amazon handle petabytes of data and millions of simultaneous users across the globe all expecting instant responses.
Example 1: Social Media Growth
· When Facebook launched in 2004, it served data from a single MySQL database.
· Within a few years, user numbers exploded from thousands to millions.
· Every user profile, post, comment, and reaction became a row in a database table.
· The relational database couldn’t handle that level of read/write concurrency and data volume.
· This led Facebook to adopt MySQL sharding and later introduce TAO, their own distributed data store for social graphs.
Example 2: E-commerce Traffic Spikes
· Consider an e-commerce site like Flipkart or Amazon during a "Big Billion Sale".
· The number of concurrent users might jump from 10,000 to 5 million within minutes.
· A single database server even a high-end one cannot handle such a surge in queries.
· Without a scalable database, users would see "Site Down" errors or slow checkout experiences.
The solution is to distribute both the data and load across multiple nodes, allowing the system to expand dynamically. These examples underline one truth that data growth is not linear, but exponential and databases must be designed to scale in step.
1.2 Why Traditional Relational Databases Hit Limits
Traditional relational databases like MySQL, PostgreSQL, and Oracle were designed in an era when data fit comfortably on a single machine. They emphasize ACID guarantees (Atomicity, Consistency, Isolation, Durability), which ensure data correctness but limit scalability.
Let’s understand their key bottlenecks:
1.2.1 Single-Server Dependency
A typical RDBMS installation runs on a single server. You can make it faster by adding more CPU, memory, or faster disks but only up to a point. Eventually, you hit hardware and I/O limits.
1.2.2 Locking and Transactions
When multiple users try to modify the same table or row, locks are used to maintain consistency. Under heavy load, this causes contention many transactions waiting for one to complete. For example, imagine 100000 users trying to update their shopping cart at the same time. The system must ensure each cart update is isolated and correct but this slows down the entire database because of locking overhead.
1.2.3 Vertical Scaling Saturation
You can upgrade to a more powerful server but after a point, adding more hardware brings diminishing returns. The CPU may be idle while the disk is busy, or the memory may be underutilized while the I/O waits on locks.
Moreover, vertical scaling (buying bigger servers) is expensive and not fault-tolerant one machine failure can take the entire database down.
1.2.4 Schema Rigidity
Relational databases enforce strict schemas. If your data model changes frequently (like in modern apps where new features evolve rapidly), schema migrations can be slow and risky, blocking production systems.
1.3 Horizontal vs Vertical Scaling
The two fundamental approaches to scaling databases are vertical scaling (scale up) and horizontal.
Horizontal Scaling
· Add more CPU, RAM, or faster storage to a single machine
· Example: Upgrading from 16-core to 64-core, 64GB to 512GB
· Pros: Simple to implement, minimal code changes sometimes no code changes at all
· Cons: Hardware limits, single point of failure and expensive
Vertical Scaling
· Add more machines (nodes) and distribute data & requests across them
· Example: Adding 10 commodity servers to share the load
· Pros: Scalable, fault-tolerant, cost-effective
· Cons: Complex coordination, consistency management
For example, suppose your MySQL server on one machine is overloaded.
· Vertical scaling: upgrade the server to 32 cores and 512GB RAM. Works temporarily, but costs rise exponentially.
· Horizontal scaling: split data across 10 smaller servers (shards), each handling a subset of data. Now you can add the 11th or 12th server as demand grows —this is true scalability.
Modern distributed databases like Cassandra, MongoDB, and CockroachDB are built natively for horizontal scaling, enabling near-infinite expansion.
1.4 Characteristics of Modern Distributed Systems
Modern databases are designed as distributed systems, a collection of independent machines working together to appear as one logical database. Let’s explore their key characteristics.
1.4.1 Fault Tolerance
Failures are inevitable like disks crash, nodes go offline, networks partition. A distributed database ensures no single point of failure. Data is replicated across nodes, so if one node goes down, another takes over seamlessly.
Example:
In MongoDB, every shard has one primary and multiple secondaries. If the primary fails, the secondaries automatically elect a new primary all without manual intervention.
1.4.2 Elasticity
Modern systems can scale elastically meaning they can add or remove resources dynamically based on load.
Cloud-native databases like Amazon DynamoDB or Google Spanner automatically scale up during traffic spikes and scale down during idle hours.
Example:
During peak shopping hours, DynamoDB might expand from 20 to 100 partitions. Once traffic drops, it automatically scales back to 20, saving cost while maintaining performance.
1.4.3 Availability
Distributed databases are built for high availability, ensuring that users can always read and write data even during failures. This is achieved by replication, automatic failover, and geo-distribution.
Example:
Cassandra replicates data across multiple nodes and even multiple data centers. If a node in Mumbai fails, requests can still be served from Delhi or Singapore replicas maintaining 24/7 uptime.
1.4.4 Eventual Consistency
To maintain availability during network failures, distributed databases often relax immediate consistency, meaning data updates may take a few seconds to propagate globally. This trade-off, known as eventual consistency, is central to systems like Cassandra and DynamoDB.
Example:
You might update your profile picture, and your friend in another region sees the old one for a few seconds that’s eventual consistency in action.
In summary, the demand for real-time, global, and large-scale applications has redefined what databases must deliver. Traditional relational databases prioritized data integrity, modern distributed systems prioritize availability, elasticity, and fault tolerance without compromising too much on consistency.
2. The Router–Shard–Replica Model
As applications grow and user requests flood in, a single database server no matter how powerful it is, eventually becomes a bottleneck. The Router–Shard–Replica model is the architectural response that allows databases to distribute both data and workload intelligently across many machines. Let’s understand how this works, step by step.
2.1 Overview of How Client Requests Are Routed
Imagine you run a global e-commerce platform where millions of customers browse, add to cart, and purchase simultaneously. When a client (say, the mobile app) sends a request to fetch a user’s order history, the system must know which server actually holds that user’s data. This is where the router or coordinator node steps in.
Instead of making every client aware of where each piece of data lives, the router acts as a traffic director:
· The client connects to the router.
· The router determines which shard (Shard holds subset of data) holds the relevant data (based on a shard key or hash function).
· It forwards the query to the correct shard and merges the results if needed.
The client sees only one endpoint, but behind the scenes, the router orchestrates the query distribution.
Example:
In MongoDB, this router is known as mongos, while in Cassandra, any node can serve as the coordinator. Both serve the same purpose simplifying the client’s view of the cluster.
2.2 The Concept of a Router (or Coordinator) Node
Think of the router as the "switchboard operator" (A switchboard operator is a person who manages incoming telephone calls, routing them to the correct person or department within an organization) of your database cluster. It doesn’t store actual data instead, it maintains metadata about where each chunk of data resides.
In MongoDB:
· The router consults config servers, which store mapping information about data chunks and shard locations.
· When a query arrives, the router checks this metadata to decide which shard(s) to contact.
· If data moves (due to rebalancing or new shard addition), the config servers update this metadata keeping the routing dynamic and consistent.
In Cassandra:
· There’s no central router. Any node can act as a coordinator by knowing the token ring layout. Read this post for more details: https://self-learning-java-tutorial.blogspot.com/2024/05/consistent-hashing-scalable-approach-to.html
· The coordinator identifies which nodes are responsible for a given partition key and forwards the request accordingly.
2.3 What a Shard Is: Horizontal Partitioning Explained
A shard is simply a subset of your total data, stored on a separate database server. Instead of one monolithic dataset living on one server, you divide it horizontally across multiple servers each responsible for a specific range or hash of data.
This is called horizontal partitioning, or sharding. Suppose an E-commerce company has 100 million customers.
· Shard 1 stores customers whose IDs start from 1–25 million
· Shard 2 stores IDs from 25–50 million
· Shard 3 stores IDs from 50–75 million
· Shard 4 stores IDs from 75–100 million
Each shard handles queries for its own range, so the load is spread evenly. Horizontal partitioning splits rows, which is more scalable for high-volume systems since it keeps schema consistent and allows parallelism.
2.3.1 Adding Shards for Capacity and Redundancy
One of the most powerful aspects of sharding is scaling by addition you don’t upgrade hardware, you just add another machine.
When an e-commecrece company grows to 200 million users, you can:
· Add new shards for new user ranges.
· Redistribute existing data across shards (a process called resharding).
· Automatically balance load so no single shard becomes a hotspot.
2.3.2 Shard Backups and Replication Strategies
Each shard must also protect against hardware failures so shards are replicated. Replication means maintaining multiple copies of the same data across different servers (often in different zones or regions).
Usually, one node acts as the primary (handling writes), while others act as secondaries (handling reads or staying in sync).
Example (MongoDB):
· Each shard is a replica set where Primary node handles all write operations and Secondary nodes replicate the data asynchronously. If the primary fails, one of the secondaries automatically becomes the new primary through an election process.
This replication provides:
· High availability: Failover without downtime.
· Disaster recovery: Even if one data center fails, another can continue serving.
· Read scalability: Secondary replicas can handle read-heavy workloads.
2.3.3 Benefits: Scalability and Resilience
The router–shard–replica model brings two major benefits:
Scalability:
· By adding shards, you add both storage and processing capacity.
· More shards = more parallelism = higher throughput.
· Horizontal scaling is near-linear when designed properly.
Resilience:
· Replicas ensure data durability and fault tolerance.
· Even in node or rack failures, data remains accessible.
· The system can self-heal via automatic failover and rebalancing.
Example:
In a replicaset of 3 nodes, if primary node crashes, a replica in another data center automatically takes over, and the router updates its routing metadata all in seconds, without user-visible downtime.
2.3.4 Challenge: Cross-Shard Joins and Data Aggregation
Distributed systems aren’t magic they trade simplicity for scale. One of the biggest trade-offs is cross-shard queries.
When data relevant to a query lives across multiple shards:
· The router must scatter the query to all shards.
· Each shard processes its local part and returns results.
· The router merges the results before sending them back to the client.
This is known as a scatter-gather operation, and it can be expensive. For example, if a report needs "total purchases across all users", every shard must contribute its partial total introducing latency and coordination overhead.
Mitigation Strategies
· Carefully design shard keys that localize related data together.
· Pre-aggregate or cache computed results.
· Use analytical systems like Spark or BigQuery for heavy aggregation workloads.
The Router–Shard–Replica model unlocks true horizontal scalability — letting databases grow seamlessly as your data and users grow. However, scaling horizontally isn’t free: it introduces data distribution, query routing, and consistency challenges. Smart data modelling, thoughtful shard key selection, and robust replication strategies are what make this architecture both powerful and reliable in real-world systems.
3. MongoDB: A Case Study in Sharded NoSQL Design
MongoDB is one of the most successful examples of a distributed database that prioritizes simplicity, scalability, and developer productivity. Its design elegantly demonstrates how the Router–Shard–Replica model works in practice, balancing high performance with fault tolerance.
Let’s explore MongoDB’s architecture piece by piece.
Introduction to MongoDB’s Distributed Architecture
At its core, MongoDB is built for horizontal scalability, meaning it can distribute data across multiple machines (shards) while keeping the application view consistent.
A typical sharded MongoDB cluster has three major components:
mongos: the query router
Shard servers: the actual data-bearing nodes
Config servers: the metadata managers
Each shard itself is not just one server but a replica set, ensuring data redundancy and high availability. Let’s see how these parts work together to form a cohesive, resilient system.
3.1 mongos: The Query Router
mongos is the entry point for all client applications. Think of it as MongoDB’s traffic controller, it knows where data resides and routes each request to the appropriate shard(s).
Here’s how it works:
· A client connects to a mongos instance instead of directly connecting to a shard.
· The mongos consults the config servers to learn where the required data chunk resides.
· It forwards the query to the correct shard or, if necessary, to multiple shards (scatter–gather).
· It aggregates the results and sends them back to the client.
This means clients don’t need to worry about where their data lives, they just query MongoDB as if it were a single large database.
Example:
If an e-comerce application stores user data sharded by userId, when a request for userId = 10234 arrives:
mongos checks metadata in config servers.
· Finds that userId 10234 is on Shard 2.
· Sends the query to that shard and returns the results.
3.2 Replica Sets: How MongoDB Ensures High Availability
Each shard in MongoDB isn’t a single point of failure, it’s a replica set, a self-healing cluster of nodes that maintain copies of the same data.
A replica set typically consists of:
· One primary node: Handles all write operations.
· Multiple secondary nodes: Replicate the data from the primary asynchronously.
This structure gives MongoDB both high availability and read scalability. If the primary node fails, MongoDB’s built-in election mechanism automatically promotes a secondary to become the new primary usually within seconds.
3.3 Primary vs Secondary Nodes
Let’s look at how reads and writes behave across these nodes.
· Writes: Always go to the primary node (to maintain a consistent order of operations).
· Reads: By default, also go to the primary for the most up-to-date data. However, applications can choose to read from secondaries using read preferences for better performance and geographic locality.
Example:
If application has a shard located in Singapore with:
· Primary: Singapore
· Secondaries: Mumbai and London
Then:
· Users in India might read from Mumbai for lower latency.
· Writes still go to the Singapore primary.
· Data replication happens asynchronously, ensuring eventual consistency.
3.4 Automatic Election Mechanism
MongoDB’s self-healing nature is one of its strongest features. If the primary node of a replica set goes down:
· Secondaries detect the loss of heartbeat from the primary.
· They initiate an election process to choose a new primary.
· The node with the most recent operations (highest replication log) is elected.
Once elected, the new primary starts accepting writes. This entire process happens automatically without manual intervention and ensure minimal downtime.
Example:
If the Singapore primary crashes, the Mumbai secondary with the latest data becomes the new primary within ~10 seconds, and all traffic is rerouted accordingly.
3.5 Config Servers: Metadata and Routing Information
Config servers are the brains of the sharded cluster they store the metadata that maps data chunks to shards.
They maintain:
· Which data ranges belong to which shards
· Cluster topology and sharding configuration
· Balancer state for redistributing chunks
In a production setup, you’ll always find three config servers to ensure redundancy and quorum-based consistency. If a new shard is added or data is rebalanced, the config servers update this metadata, and mongos routers start using the new configuration almost instantly.
3.6 Shard Key Selection and Its Impact on Performance
The shard key is the single most important design decision in a MongoDB sharded cluster. It determines how data is distributed across shards and thus, how evenly your workload is balanced.
A shard key can be:
· Hashed (e.g., hash of userId): distributes data evenly but can make range queries slower.
· Ranged (e.g., date or numeric range): supports efficient range queries but risks hotspotting if values are sequential.
Example:
In an e-commecre application, if you shard by region, and 80% of your traffic comes from Asia, your Asia shard will be overloaded and creating a hotspot. A better approach might be to shard by hash(userId) or (region, userId) composite key.
Always choose a shard key that:
· Distributes data evenly
· Matches your most common query patterns
Bad shard key selection is the #1 cause of performance issues in sharded MongoDB deployments.
In summary, MongoDB is designed for a world where networks can fail. It prioritizes Availability (A) and Partition Tolerance (P), accepting eventually consistent reads from secondary nodes to achieve this. This isn't a weakness, it's a deliberate, powerful choice.
However, MongoDB isn't designed for scenarios requiring strict, real-time cross-document transactions, for that, relational systems are better suited.
4. Cassandra: A Ring-Based Approach to Distribution
While MongoDB uses a router–shard–replica hierarchy, Cassandra takes a very different route. It was designed from day one to handle massive, always-on, global-scale workloads with no single point of failure.
Let’s explore how it achieves that through its peer-to-peer ring architecture.
The Peer-to-Peer Ring Architecture
Cassandra’s core idea is very simple, every node in the cluster is equal there’s no master, router, or coordinator. All nodes form a ring, each responsible for a specific range of data determined by a partition key.
How it works?
· Data is distributed across nodes using consistent hashing.
· Each node is assigned a range of tokens (hash values).
· When a record is written, its partition key (say, user_id) is hashed to find which node owns it.
· That node (and its replicas) handle the read and write requests for that data.
· Because every node knows the entire ring layout, any node can act as a coordinator, meaning the client can connect to any node, and it’ll route the request correctly.
Example:
Let’s say we have four nodes with these token ranges:
Node A: 0 – 25
Node B: 26 – 50
Node C: 51 – 75
Node D: 76 – 100
If a user with user_id = 43 writes data, it hashes to 43 → handled by Node B. Node B stores the data and replicates it to other nodes (as per replication strategy). Unlike MongoDB’s mongos, Cassandra doesn’t need a central router. The ring itself is the routing layer.
4.1 How Cassandra Eliminates Single Points of Failure?
Cassandra’s masterless design is the cornerstone of its fault tolerance.
· Every node can serve reads and writes.
· No node is special, removing any node doesn’t stop the cluster.
· Metadata (cluster topology, ring ownership) is gossiped among nodes, so all stay synchronized.
If one node fails:
· Its data is still available from other replicas.
· The system continues to operate without downtime.
· When the failed node recovers, it performs hinted handoff and read repair to catch up with missed writes.
Example:
· If Node B (responsible for user_id 43) goes down, Node C and Node D (replicas) still have copies of that user’s data.
· Clients can still read and write, Cassandra automatically reroutes.
· When Node B comes back, it syncs missed updates from peers.
This design gives Cassandra continuous availability, even during network partitions or regional outages.
4.2 Replication Strategies Across Nodes
Cassandra doesn't just store one copy of data, it replicates it intelligently for fault tolerance and performance. Replication is controlled by the replication factor (RF) and strategy defined per keyspace (like a database).
Common Replication Strategies
· SimpleStrategy (for single data centers): Data and its replicas are placed on the next N nodes clockwise in the ring.
· NetworkTopologyStrategy (for multiple data centers): You can specify how many replicas to keep per data center, enabling geo-redundancy.
Example
For a keyspace user_data:
CREATE KEYSPACE user_data
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east': 3,
  'eu-west': 2
};
This keeps:
· 3 replicas in the US-East region
· 2 replicas in EU-West
so even if one data center fails, another can continue serving.
4.3 How Cassandra Handles Writes and Achieves Eventual Consistency
Cassandra’s write path is one of its most efficient features. Writes are fast, sequential, and always accepted even during partial failures.
Here’s what happens step by step when a client writes data:
· The client sends the write to any node (the coordinator).
· The coordinator hashes the partition key to find the responsible node(s).
· The coordinator forwards the write to all replica nodes.
Each replica:
· Logs the write to its commit log (for durability),
· Writes it to an in-memory structure called a memtable.
· The coordinator waits for acknowledgments from replicas (depending on consistency level).
· Later, data is flushed to disk as an SSTable (sorted string table).
This design makes writes append-only and non-blocking, which means Cassandra can sustain millions of writes per second.
4.4 Consistency Levels and Eventual Consistency
Cassandra is an AP (Available + Partition Tolerant) system in CAP terms. It sacrifices strict consistency in favor of availability, but allows you to tune consistency per query.
You can specify consistency levels like:
· ONE: Wait for 1 replica acknowledgment (fastest, but weakest consistency)
· QUORUM: Wait for majority replicas (balanced)
· ALL: Wait for all replicas (strongest consistency, slower)
Example
If replication factor = 3:
· Write at QUORUM waits for 2 replicas to confirm.
· Read at QUORUM ensures overlap with the write quorum, yields consistent results.
· This tunable model lets developers choose between speed and consistency per operation.
How Cassandra achieves eventual consistency?
· Hinted handoff: Temporarily stores writes for unreachable nodes and replays later.
· Read repair: During reads, if replicas have inconsistent data, Cassandra fixes them in the background.
· Anti-entropy repair: Periodically syncs SSTables between replicas.
Together, these mechanisms guarantee that even if replicas temporarily diverge, they’ll eventually converge to a consistent state.
5. Real-World Challenges in Scalable Databases
Scaling databases horizontally sounds elegant in theory, add more nodes, distribute data, and handle more load. In practice, however distributed systems face unique operational challenges that impact performance, reliability, and cost efficiency.
5.1 Resharding: Redistributing Data Safely When Scaling Out
When your data grows or your traffic spikes, you may need to add more shards (machines or partitions). But how do you move terabytes of data across the cluster without downtime or data loss?
Suppose you have 4 shards handling user data distributed by user_id:
· Shard 1 → Users 1–25M
· Shard 2 → Users 25M–50M
· Shard 3 → Users 50M–75M
· Shard 4 → Users 75M–100M
Now your user base doubles to 200M. You add 4 more shards. Resharding means repartitioning this range (e.g., 1–12.5M, 12.5–25M, etc.) and migrating chunks of data safely.
Challenges
· Data must move without blocking reads/writes.
· Hash-based sharding is hard to rebalance evenly.
· Range-based sharding can lead to hotspots if data isn’t evenly distributed.
How Systems Handle It?
· MongoDB uses a balancer process that moves chunks in the background.
· Cassandra uses consistent hashing, which automatically redistributes data when new nodes join.
· Spanner / Vitess use online resharding techniques with dual writes and data copy.
5.2 Hotspots: The "Celebrity Problem" and Uneven Data Distribution
Some data items become "celebrities", they receive disproportionate read/write traffic. Even in a well sharded system, that single record can overload one shard.
For example, In a social media app
"@elonmusk" or "@cristiano" might get millions of hits per second.
All requests for that username land on the same shard.
Impact
· One shard becomes the bottleneck.
· Latency increases.
· The rest of the cluster sits underutilized.
Solutions
· Key Salting: Add random suffixes/prefixes (e.g., “@elonmusk_1”, “@elonmusk_2”) to distribute traffic.
· Read Replicas: Serve reads from multiple replicas instead of one.
· Caching Layers: Use Redis or CDN caching to offload frequent reads.
· Adaptive Load Balancing: Detect and redistribute "hot" keys dynamically.
Hotspots are not about lack of capacity, they’re about imbalanced demand on specific partitions.
5.3 Smart Routing Algorithms and Adaptive Sharding Strategies
Routing each query to the correct shard efficiently becomes tricky as the number of shards grows.
For example, a router node (like mongos in MongoDB or a coordinator in Cassandra) must know where to find the document for user_id=73,459,182. With 100+ shards, a naive routing approach can become costly.
Techniques Used
· Consistent Hashing: Ensures minimal data movement when nodes are added or removed.
· Directory-Based Sharding: Maintain a metadata table that maps key ranges to the shard locations.
· Adaptive Routing: Systems like CockroachDB monitor query latency and dynamically adjust routing paths.
5.4 Balancing Traffic Across Shards Dynamically
Even with correct sharding, workloads change over time. Certain shards may become busier (due to regional traffic, new app features, or time-based events).
Example
· An e-commerce site sees spikes in one country during a festival.
· All users from that region are mapped to a specific shard.
· That shard becomes overloaded while others stay idle.
Solutions
· Auto-balancing: Monitor load and redistribute ranges or replicas dynamically.
· Dynamic Shard Splitting: Split a heavily loaded shard into smaller sub-shards.
· Traffic-aware Proxying: Route new sessions to less-loaded shards.
References
https://engineering.fb.com/2013/06/25/core-infra/tao-the-power-of-the-graph/
System Design Questions
 
 
No comments:
Post a Comment