Programming for beginners: Consistent Hashing: A Scalable Approach to Distributed Data Storage

In the realm of distributed systems, efficient data distribution is crucial for optimizing performance and ensuring scalability. One fundamental approach to distributing data across multiple servers is through hashing. This blog post will explore a straightforward method for distributing string keys across a distributed cache using simple hashing, discuss the problems with simple hashing, and how to resolve these problems with Consistent Hashing.

Simple Hashing to distribute the data

Assume, we are going to build a distributed cache application, where we have keys of type String. When dealing with string keys, the hash function needs to convert the string into a numerical value. We then use the modulo operation to determine the server where the data should be stored. The basic formula is:

getServer(key)=hashFunction(key)%n

Here,

a. key is the string to be hashed, and

b. n is the number of servers.

getServer(key) method return the server to store this key.

public int getServer(String key) {
	int hashValue = hash(key);
	return Math.abs(hashValue) % noOfServers;
}

Following diagram demonstrate the same.

When a key needs to be added to a server, its server location is determined using the getServer(key) method.

The distribution of keys is as follows:

· KEY_0 is stored on Server_3

· KEY_1 is stored on Server_3

· KEY_2 is stored on Server_1

· KEY_3 is stored on Server_2

· KEY_4 is stored on Server_4

· KEY_5 is stored on Server_3

· KEY_6 is stored on Server_3

· KEY_7 is stored on Server_0

· KEY_8 is stored on Server_4

· KEY_9 is stored on Server_3

· KEY_10 is stored on Server_1

· KEY_11 is stored on Server_0

· KEY_12 is stored on Server_3

· KEY_13 is stored on Server_4

· KEY_14 is stored on Server_1

· KEY_15 is stored on Server_0

· KEY_16 is stored on Server_2

· KEY_17 is stored on Server_2

· KEY_18 is stored on Server_3

· KEY_19 is stored on Server_0

Distribution Statistics

· Server_0 contains 4 keys

· Server_1 contains 3 keys

· Server_2 contains 3 keys

· Server_3 contains 7 keys

· Server_4 contains 3 keys

Demo application for the same.

SimpleHashing.java

package com.sample.app.hashing;

import java.math.BigInteger;
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class SimpleHashing {
	private static MessageDigest MESSAGE_DIGEST = null;

	static {
		try {
			MESSAGE_DIGEST = MessageDigest.getInstance("SHA-256");
		} catch (NoSuchAlgorithmException e) {
			e.printStackTrace();
		}
	}

	private int noOfServers;

	public SimpleHashing(int noOfServers) {
		this.noOfServers = noOfServers;
	}

	private int hash(String key) {
		byte[] hashBytes = MESSAGE_DIGEST.digest(key.getBytes(StandardCharsets.UTF_8));
		BigInteger hashValue = new BigInteger(1, hashBytes);
		return hashValue.intValue();
	}

	public int getServer(String key) {
		int hashValue = hash(key);
		return Math.abs(hashValue) % noOfServers;
	}

	public void printKeysDistrribution(List<String> keys) {
		Map<Integer, Integer> stats = new HashMap<>();

		for (String key : keys) {
			int serverIndex = getServer(key);

			stats.putIfAbsent(serverIndex, 0);
			int count = stats.get(serverIndex);
			stats.put(serverIndex, count + 1);

			System.out.println(key + " is stored in Server_" + serverIndex);
		}

		System.out.println("\nDistribution statistics");
		for (int i = 0; i < noOfServers; i++) {
			System.out.println("Server_" + i + " contain " + stats.get(i) + " keys");
		}
	}

}

SimpleHashingDemo.java

package com.sample.app.hashing;

import java.util.ArrayList;
import java.util.List;

public class SimpleHashingDemo {
	
	public static void main(String args[]) {
		List<String> keys = new ArrayList<>();
		String keyPrefix = "KEY_";
		int noOfKeys = 20;
		int noOfServers = 5;
		
		for(int i=0; i < noOfKeys; i++) {
			keys.add(keyPrefix+i);
		}
		
		SimpleHashing simpleHashing = new SimpleHashing(noOfServers);
		simpleHashing.printKeysDistrribution(keys);
		
	}

}

Output

KEY_0 is stored in Server_3
KEY_1 is stored in Server_3
KEY_2 is stored in Server_1
KEY_3 is stored in Server_2
KEY_4 is stored in Server_4
KEY_5 is stored in Server_3
KEY_6 is stored in Server_3
KEY_7 is stored in Server_0
KEY_8 is stored in Server_4
KEY_9 is stored in Server_3
KEY_10 is stored in Server_1
KEY_11 is stored in Server_0
KEY_12 is stored in Server_3
KEY_13 is stored in Server_4
KEY_14 is stored in Server_1
KEY_15 is stored in Server_0
KEY_16 is stored in Server_2
KEY_17 is stored in Server_2
KEY_18 is stored in Server_3
KEY_19 is stored in Server_0

Distribution statistics
Server_0 contain 4 keys
Server_1 contain 3 keys
Server_2 contain 3 keys
Server_3 contain 7 keys
Server_4 contain 3 keys

Problems with Simple Hashing

While simple hashing is an effective and straightforward method for distributing data across multiple servers, it struggles with a notable drawback, particularly when handling node failures and when new nodes are added to the cluster.

Redistribution of Data on Node Failure

One significant drawback of simple hashing is the need to redistribute data when a node goes down. Here's how it works and why it becomes a problem:

Initial Distribution: Initially, data is distributed across the servers based on the hash function. For example, using hash(key) = hashFunction(key) % n, where n is the number of servers.

Node Failure: Suppose one of the servers fails. This reduces the total number of servers (n). For instance, if Server_4 goes down in a system with 5 servers, n becomes 4.

Recalculation: When a node fails, all keys need to be rehashed and redistributed according to the new number of servers. This means recalculating hash(key) % 4 for all keys.

Data Movement: As a result of recalculating the hash values, most of the data will need to be moved to different servers. This is because the hash values modulo 4 will likely differ from the original values modulo 5, causing keys to map to different servers.

The distribution of keys is as follows:

· KEY_0 is stored in Server_3

· KEY_1 is stored in Server_1

· KEY_2 is stored in Server_0

· KEY_3 is stored in Server_1

· KEY_4 is stored in Server_2

· KEY_5 is stored in Server_0

· KEY_6 is stored in Server_1

· KEY_7 is stored in Server_3

· KEY_8 is stored in Server_3

· KEY_9 is stored in Server_1

· KEY_10 is stored in Server_2

· KEY_11 is stored in Server_0

· KEY_12 is stored in Server_3

· KEY_13 is stored in Server_1

· KEY_14 is stored in Server_1

· KEY_15 is stored in Server_3

· KEY_16 is stored in Server_2

· KEY_17 is stored in Server_1

· KEY_18 is stored in Server_0

· KEY_19 is stored in Server_0

Distribution statistics

· Server_0 contain 5 keys

· Server_1 contain 7 keys

· Server_2 contain 3 keys

· Server_3 contain 5 keys

Following table summarizes the keys distribution, when the number of servers are 4 and 5.

Key	Number of Servers 4	Number of Servers 5
KEY_0	Server_3	Server_3
KEY_1	Server_1	Server_3
KEY_2	Server_0	Server_1
KEY_3	Server_1	Server_2
KEY_4	Server_2	Server_4
KEY_5	Server_0	Server_3
KEY_6	Server_1	Server_3
KEY_7	Server_3	Server_0
KEY_8	Server_3	Server_4
KEY_9	Server_1	Server_3
KEY_10	Server_2	Server_1
KEY_11	Server_0	Server_0
KEY_12	Server_3	Server_3
KEY_13	Server_1	Server_4
KEY_14	Server_1	Server_1
KEY_15	Server_3	Server_0
KEY_16	Server_2	Server_2
KEY_17	Server_1	Server_2
KEY_18	Server_0	Server_3
KEY_19	Server_0	Server_0

Same is the case when we add new servers to existing clusters. This problem is addressed with consistent Hashing.

Consistent Hashing

As per Wikipedia, In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only 𝑛/𝑚 keys need to be remapped on average where

a. 𝑛 is the number of keys and

b. 𝑚 is the number of slots.

In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.

Let me explain this with an example.

Suppose you have 4 servers (Server_1, Server_2, Server_3, and Server_4) and you want to distribute all the keys among these four servers.

Step 1: Calculate the hash codes for all these servers. The hash code is generated using the server name, IP address, or any other unique identifier for each server.

Following table summarizes the hashCodes.

Server	HashCode
Server_1	-24244935
Server_2	813726879
Server_3	-937748046
Server_4	397230681

Let’s arrange these hash codes in a circle shape.

Arrange Hash Codes in Ascending Order

-937748046 (Server_3)

-24244935 (Server_1)

397230681 (Server_4)

813726879 (Server_2)

The arrangement in the circular ring.

Server_3 (-937748046) --> Server_1 (-24244935) --> Server_4 (397230681) --> Server_2 (813726879)

Suppose we want to distribute keys KEY_1 to KEY_9 among the servers using consistent hashing. We must compute the hash codes for these keys and compare them with the hash codes of our servers to determine the appropriate server for each key.

Following table summarizes the hashcodes of the keys.

Key	HashCode
KEY_0	145963125
KEY_1	680568006
KEY_2	560725153
KEY_3	-2048590292
KEY_4	481448715
KEY_5	-707126076
KEY_6	634733290
KEY_7	-684746024
KEY_8	-976564317
KEY_9	158285436

Let's find the server each key will be assigned to:

KEY_0 (145963125)

· 145963125 is greater than -24244935 (Server_1) and less than 397230681 (Server_4).

· So, KEY_0 is assigned to Server_4.

KEY_1 (680568006)

· 680568006 is greater than 397230681 (Server_4) and less than 813726879 (Server_2).

· So, KEY_1 is assigned to Server_2.

KEY_2 (560725153)

· 560725153 is greater than 397230681 (Server_4) and less than 813726879 (Server_2).

· So, KEY_2 is assigned to Server_2.

KEY_3 (-2048590292)

· -2048590292 is less than -937748046 (Server_3), which means it should be placed before the first server, wrapping around the ring.

· So, KEY_3 is assigned to Server_3.

KEY_4 (481448715)

· 481448715 is greater than 397230681 (Server_4) and less than 813726879 (Server_2).

· So, KEY_4 is assigned to Server_2.

KEY_5 (-707126076)

· -707126076 is greater than -937748046 (Server_3) and less than -24244935 (Server_1).

· So, KEY_5 is assigned to Server_1.

KEY_6 (634733290)

· 634733290 is greater than 397230681 (Server_4) and less than 813726879 (Server_2).

· So, KEY_6 is assigned to Server_2.

KEY_7 (-684746024)

· -684746024 is greater than -937748046 (Server_3) and less than -24244935 (Server_1).

· So, KEY_7 is assigned to Server_1.

KEY_8 (-976564317)

· -976564317 is less than -937748046 (Server_3), which means it should be placed before the first server, wrapping around the ring.

· So, KEY_8 is assigned to Server_3.

KEY_9 (158285436)

· 158285436 is greater than -24244935 (Server_1) and less than 397230681 (Server_4).

· So, KEY_9 is assigned to Server_4.

Summary of Key Assignments:

Server_3:

· KEY_3 (-2048590292)

· KEY_8 (-976564317)

Server_1:

· KEY_5 (-707126076)

· KEY_7 (-684746024)

Server_4:

· KEY_0 (145963125)

· KEY_9 (158285436)

Server_2:

· KEY_1 (680568006)

· KEY_2 (560725153)

· KEY_4 (481448715)

· KEY_6 (634733290)

This arrangement ensures that each key is assigned to the nearest server in the clockwise direction on the hash ring, minimizing the amount of key movement required when servers are added or removed.

How to identify the server for given key?

To identify the server where a specific key (in this case, KEY_5) persists in a consistent hashing setup, you follow these steps:

1. Calculate the Hash Code of the Key: First, you need the hash code of the key. For KEY_5, this is given as -707126076.

2. Locate the Position in the Hash Ring: Next, you find the position of this hash code in the sorted order of the server hash codes on the hash ring.

3. Find the Nearest Server Clockwise: Starting from the calculated hash code of the key, move clockwise on the ring to find the first server with a hash code greater than or equal to the key’s hash code. If no such server exists, wrap around to the first server in the sorted order.

Steps to Identify the Server for KEY_5:

1. Arrange Server Hash Codes in Ascending Order:

· -937748046 (Server_3)

· -24244935 (Server_1)

· 397230681 (Server_4)

· 813726879 (Server_2)

2. Determine Position of KEY_5 Hash Code:

· The hash code for KEY_5 is -707126076.

· Locate the position of -707126076 in the sorted list of server hash codes.

3. Find the Nearest Server Clockwise:

· -707126076 is greater than -937748046 (Server_3) and less than -24244935 (Server_1).

· Thus, the nearest server in the clockwise direction is Server_1, with a hash code of -24244935.

Therefore, KEY_5 with hash code -707126076 will persist in Server_1.

Find the below working application.

ConsistentHashing.java

package com.sample.app.hashing;

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.SortedMap;
import java.util.TreeMap;

public class ConsistentHashing {
	private final SortedMap<Integer, String> circle = new TreeMap<>();

	public ConsistentHashing(String[] servers) {
		for (String server : servers) {
			addServer(server);
		}
	}

	private int hash(String key) {
		try {
			MessageDigest md = MessageDigest.getInstance("SHA-256");
			byte[] hashBytes = md.digest(key.getBytes(StandardCharsets.UTF_8));
			return new String(hashBytes, StandardCharsets.UTF_8).hashCode();
		} catch (NoSuchAlgorithmException e) {
			throw new RuntimeException(e);
		}
	}

	private void addServer(String server) {
		circle.put(hash(server), server);
	}

	public void printServer(String key) {
		if (circle.isEmpty()) {
			return;
		}
		int hash = hash(key);
		int keyHash = hash;
		if (!circle.containsKey(hash)) {
			SortedMap<Integer, String> tailMap = circle.tailMap(hash);
			hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
		}
		String server =  circle.get(hash);
		System.out.println("Key " + key + " and hashCode : " + keyHash + " is assigned to " + server);
	}
	
	public void printRing() {
		for(Integer key: circle.keySet()) {
			System.out.println(key + " " + circle.get(key));
		}
	}

}

ConsistentHashingDemo.java

package com.sample.app.hashing;

public class ConsistentHashingDemo {
	public static void main(String[] args) {
		String[] servers = { "Server_1", "Server_2", "Server_3", "Server_4" };
		ConsistentHashing consistentHashing = new ConsistentHashing(servers);
		consistentHashing.printRing();

		System.out.println();
		String keyPrefix = "KEY_";
		for (int i = 0; i < 10; i++) {
			consistentHashing.printServer(keyPrefix + i);
		}

	}
}

Output

-937748046 Server_3
-24244935 Server_1
397230681 Server_4
813726879 Server_2

Key KEY_0 and hashCode : 145963125 is assigned to Server_4
Key KEY_1 and hashCode : 680568006 is assigned to Server_2
Key KEY_2 and hashCode : 560725153 is assigned to Server_2
Key KEY_3 and hashCode : -2048590292 is assigned to Server_3
Key KEY_4 and hashCode : 481448715 is assigned to Server_2
Key KEY_5 and hashCode : -707126076 is assigned to Server_1
Key KEY_6 and hashCode : 634733290 is assigned to Server_2
Key KEY_7 and hashCode : -684746024 is assigned to Server_1
Key KEY_8 and hashCode : -976564317 is assigned to Server_3
Key KEY_9 and hashCode : 158285436 is assigned to Server_4

Advantage of Consistent Hashing When a Server Goes Down

Consistent hashing provides significant advantages in distributed systems, particularly when servers are dynamically added or removed. Let's discuss the specific scenario where Server_1 goes down and how consistent hashing handles this situation effectively:

From our previous assignment, we know, Server_1: Stores KEY_5 and KEY_7.

Consistent Hashing Behaviour when Server_1 goes down

When Server_1 goes down, consistent hashing ensures that only a minimal number of keys need to be reassigned to the remaining servers. This minimizes disruption and the need for data transfer, unlike other hashing schemes like simple modulo-based hashing where all keys might need to be redistributed.

1. Redistribution Process: Identify Keys on Server_1: Identify the keys that were assigned to Server_1, which are KEY_5 and KEY_7.

2. Reassign Affected Keys: Reassign these keys to the next available server in the clockwise direction.

Key Reassignment

KEY_5 (-707126076)

· Previously mapped to Server_1 (-24244935)

· Now, find the next server in the clockwise direction:

· Nearest server hash code greater than -707126076: Server_4 (397230681)

· Reassigned to Server_4

KEY_7 (-684746024)

· Previously mapped to Server_1 (-24244935)

· Now, find the next server in the clockwise direction:

· Nearest server hash code greater than -684746024: Server_4 (397230681)

· Reassigned to Server_4

Updated Key Distribution

After Server_1 goes down and the affected keys are reassigned, the new distribution is as follows:

Server_3:

· KEY_3 (-2048590292)

· KEY_8 (-976564317)

Server_4:

· KEY_0 (145963125)

· KEY_5 (-707126076) (Reassigned)

· KEY_7 (-684746024) (Reassigned)

· KEY_9 (158285436)

Server_2:

· KEY_1 (680568006)

· KEY_2 (560725153)

· KEY_4 (481448715)

· KEY_6 (634733290)

Advantage of Consistent Hashing When a new server added

When a new server, "New_Server" with hash code -53574582, is added to the consistent hashing ring, consistent hashing ensures that only a minimal number of keys need to be reassigned. This minimizes disruption and ensures an even distribution of keys across the servers.

Given Servers and Hash Codes:

· Server_3: -937748046

· Server_1: -24244935

· Server_4: 397230681

· Server_2: 813726879

New Server Added:

· New_Server: -53574582

Any key with a hash code between -53574582 (New_Server) and -24244935 (Server_1) will be reassigned to New_Server. As per this logic, KEY_5 (-707126076) and KEY_7 (-684746024) will be reassigned to New_Server.

New Distribution:

Server_3:

· KEY_3 (-2048590292)

· KEY_8 (-976564317)

Server_1:

· Previously had KEY_5 and KEY_7, now re-assigned to New_Server

Server_4:

· KEY_0 (145963125)

· KEY_9 (158285436)

Server_2:

· KEY_1 (680568006)

· KEY_2 (560725153)

· KEY_4 (481448715)

· KEY_6 (634733290)

New_Server:

· KEY_5 (-707126076)

· KEY_7 (-684746024)

Virtual Servers in Consistent Hashing

Virtual servers help achieve more even distribution of keys and better load balancing. Without virtual servers, the hash values might cluster around certain servers, causing them to handle a disproportionate amount of traffic. By assigning multiple virtual servers to each physical server, we can smooth out these imbalances.

If Server_2 goes down, all keys that were supposed to go to Server_2 will now go to Server_3, as it's the next server in the clockwise direction. This causes Server_3 to suddenly take on the load that was previously handled by Server_2.

Scenario with Virtual Servers

Let's assume each server is represented by three virtual servers.

Server_1

· Server_1.1

· Server_1.2

· Server_1.3

Server_2

· Server_2.1

· Server_2.2

· Server_2.3

Server_3

· Server_3.1

· Server_3.2

· Server_3.3

Server_4

· Server_4.1

· Server_4.2

· Server_4.3

Handling Large Gaps and Load Balancing:

Virtual servers ensure that the load is more evenly distributed. In this case, even if Server_2 goes down, its load is not transferred to a single server but rather distributed to multiple servers (Server_1.2, Server_3.2, and Server_4.1). This prevents any single server from becoming a bottleneck.

Benefits of Virtual Servers:

1. More Even Load Distribution: Virtual servers spread the load across multiple points on the hash ring, reducing the likelihood of any single server being overwhelmed.

2. Better Fault Tolerance: When a server goes down, the load it was handling is distributed among multiple servers, maintaining system performance and reliability.

3. Scalability: Adding or removing servers can be done smoothly with minimal disruption, as the impact is distributed across multiple virtual nodes.

In summary, virtual servers in consistent hashing help achieve a more balanced distribution of keys, improving load balancing and fault tolerance in the system. When a server goes down, the load is evenly distributed among the remaining servers, ensuring the system continues to operate efficiently.

System Design Questions

Programming for beginners

Monday, 20 May 2024

Consistent Hashing: A Scalable Approach to Distributed Data Storage

No comments:

Post a Comment