Programming for beginners: Sharding: Breaking Up the Big One

Sharding is a method used in databases and various systems to break down large data sets into smaller, and manageable segments. Each segment, known as a shard, is an independent database, and together these shards constitute the full database.

Advantages of Sharding

1. Improved Scalability: Sharding enables the distribution of data across multiple servers, facilitating the addition of more storage and computing power. This capability is essential for handling larger data sets and increasing workloads.

2. Better Performance: By distributing data across several nodes, the load on individual servers is reduced. This leads to quicker response times for queries and enhanced overall performance of the system.

3. Increased Availability: In a sharded system, the failure of one server or shard does not impact the entire system. The other shards continue to operate independently, which reduces downtime and the risk of data loss.

4. Easier Management: Managing smaller chunks of data is simpler than dealing with huge data sets. Tasks like maintenance and backups become more straightforward/easier.

Types of Sharding

There are primarily two sharding strategies:

1. Horizontal Sharding: In this method, data rows are distributed evenly across different shards based on certain criteria, such as user IDs, date ranges, or alphabetical order. This approach is commonly used, for example, we can shard employee data by employee ID.

2. Vertical Sharding: This technique involves dividing data columns into separate shards. Though less common, it's useful in specific situations, such as segregating frequently accessed columns from those that are seldom used.

Challenges in Sharding

Implementing sharding introduces certain complexities into the database structure. For instance, managing transactions that span multiple shards can be challenging, and maintaining data consistency across shards might be more complicated.

In summary, sharding is an effective approach to manage large data sets by dividing them into more manageable parts, but it also brings additional complexity and considerations in terms of database management and architecture.

Example 1: File sharding

Divide a big file into multiple shards based on the number of shards.

FileShardingByNoOfShards.java

package com.sample.app;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

public class FileShardingByNoOfShards {

	public static void shardFiles(String inputFile, int numShards, String shardFilePrefix) {

		try (BufferedReader reader = new BufferedReader(new FileReader(inputFile))) {
			FileWriter[] writers = new FileWriter[numShards];
			for (int i = 0; i < numShards; i++) {
				writers[i] = new FileWriter(shardFilePrefix + i + ".txt");
			}

			String line;
			int shardIndex = 0;
			while ((line = reader.readLine()) != null) {
				writers[shardIndex].write(line + "\n");
				shardIndex = (shardIndex + 1) % numShards;
			}

			for (FileWriter writer : writers) {
				writer.close();
			}

			System.out.println("File sharding completed successfully.");
		} catch (IOException e) {
			System.out.println("Error during file sharding: " + e.getMessage());
		}
	}

	public static void main(String[] args) {
		String inputFile = FileShardingByNoOfShards.class.getClassLoader().getResource("demo.txt").getPath();
		shardFiles(inputFile, 4, "shard");
	}

}

Example 2: Shard Employee data by id.

Employee.java

package com.sample.app.model;

public final class Employee {
	private final int id;
	private final String name;

	public Employee(int id, String name) {
		this.id = id;
		this.name = name;
	}

	public int getId() {
		return id;
	}

	public String getName() {
		return name;
	}

	@Override
	public String toString() {
		return "Employee{" + "id=" + id + ", name='" + name + '\'' + '}';
	}
}

EmployeeSharding.java

package com.sample.app.util;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import com.sample.app.model.Employee;

public final class EmployeeSharding {

	private final Map<Integer, List<Employee>> shards;
	private final int numberOfShards;

	public EmployeeSharding(int numberOfShards) {
		shards = new HashMap<>();
		this.numberOfShards = numberOfShards;
	}

	public void addEmployee(Employee employee) {
		int shardKey = employee.getId() % numberOfShards;
		shards.computeIfAbsent(shardKey, k -> new ArrayList<>()).add(employee);
	}

	public void printShards() {
		for (Integer shard : shards.keySet()) {
			System.out.println("Shard " + shard + ": " + shards.get(shard));
		}
	}

}

EmployeeShardingDemo.java

package com.sample.app;

import com.sample.app.model.Employee;
import com.sample.app.util.EmployeeSharding;

public class EmployeeShardingDemo {
	
	public static void main(String[] args) {
		EmployeeSharding sharding = new EmployeeSharding(3);

		// Adding some employees
		sharding.addEmployee(new Employee(1, "Ram"));
		sharding.addEmployee(new Employee(2, "Hari"));
		sharding.addEmployee(new Employee(3, "Krishna"));
		sharding.addEmployee(new Employee(4, "Harika"));
		sharding.addEmployee(new Employee(5, "Sailu"));

		// Print the shards
		sharding.printShards();
	}
}

Output

Shard 0: [Employee{id=3, name='Krishna'}]
Shard 1: [Employee{id=1, name='Ram'}, Employee{id=4, name='Harika'}]
Shard 2: [Employee{id=2, name='Hari'}, Employee{id=5, name='Sailu'}]

In this example:

1. EmployeeSharding class handles the logic for sharding.

2. addEmployee method adds an Employee to the appropriate shard based on their ID.

3. printShards method prints the contents of each shard.

The number of shards is specified when you add an employee. Each employee is placed in a shard based on the remainder of their ID divided by the number of shards (employee.getId() % numberOfShards). This is a simple sharding strategy and can be adjusted based on specific use cases.

System Design Questions

Programming for beginners

Monday, 8 January 2024

Sharding: Breaking Up the Big One

No comments:

Post a Comment