Tuesday, 28 May 2024

Gossip Protocol: System Failure Detection in Decentralized Distributed System

Gossip is a broadcasting protocol used for distributing the information. Instead of relying on a centralized server to host cluster information, peers share information directly with each other. In some ad-hoc networks that lack a central registry, the only way to distribute common data is for each member to pass it along to their neighbors.

 

This concept of gossip communication is similar to office workers spreading rumours. Let me explain it with an example.

 

Let me explain it with an example, where the gossip about Harika's promotion spreads among the 10 employees using a gossip protocol.

 


Initial Gossip Source

Harika gets promoted and tells Sailaja about it first. Now both Harika and Sailaja knows about Harika’s promotion.

 


First round of Gossip

Sailaja randomly selects Gopi to gossip with and tells Gopi, "Harika is promoted.



Second Round of Gossip

1.   Sailaja randomly selects Chamu to gossip with and tells Chamu, "Harika is promoted."

2.   Gopi randomly selects Raj to gossip with and tells Raj, "Harika is promoted."

 


Third Round of Gossip

1.   Chamu randomly selects David to gossip with and tells David, "Harika is promoted."

2.   Raj randomly selects Bob to gossip with and tells Bob, "Harika is promoted."

3.   Gopi randomly selects Joel to gossip with and tells Joel, "Harika is promoted."




Fourth Round of Gossip

1.   David randomly selects Alice to gossip with and tells Alice, "Harika is promoted."

2.   Bob randomly selects Ram to gossip with and tells Ram, "Harika is promoted."

3.   Raj randomly selects Sailaja to gossip with, but Sailaja already knows the gossip, so no new information is spread.

4.   Joel randomly selects Alice to gossip with, but Alice already knows the gossip.

 


At this point, everyone knows about Harika's promotion. It took four rounds for the gossip to spread to all 10 employees. Some employees heard the gossip more than once, which is common in gossip protocols. After certain rounds of gossips, all employees eventually knew about Harika's promotion.

 

How Gossip protocol helps to find system failures in decentralized distributed system

Assume I have 6 systems together formed a decentralized distributed system.



Each node maintains node membership list table, which contains node id and last heartbeat timestamp etc., Following table summarizes sample attributes stored in a Node Membership table.

 

Attribute

Description

Node Id

Unique identifier for each Node like N1, N2…

Heartbeat Counter

A counter that increments with each heartbeat sent by the node.

Timestamp

The last time the node received a heartbeat from this node. Helps in determining the latest health of the system.

Status

The current status of the node (e.g., Alive, Suspect, Failed).

Suspect Counter

A counter that increments when a node is suspected of having failed. It helps in determining the confidence level of a node's failure status.

Gossip Round

The last gossip round in which the node's information was updated.

Indirect Ping List

A list of nodes used to indirectly verify the status of a suspect node. Helps in confirming the failure of a node through other nodes.

Metadata

Additional information such as node load, resource usage, or any other relevant metrics. Useful for load balancing and resource management.

 

Scenario: Node 2 went down and how it is identified in decentralized distributed network

 

Let’s start with the Gossip process starts with the node N1. Node N1 membership table looks like below before failure detection.

 

Node Id

Heartbeat counter

Timestamp

Status

Suspect Counter

Indirect Ping list

N1

120

2024-05-28 10:00:00

Alive

0

[]

N2

118

2024-05-28 09:59:50

Alive

0

[]

N3

115

2024-05-28 10:00:00

Alive

0

[]

N4

112

2024-05-28 10:00:00

Alive

0

[]

N5

119

2024-05-28 10:00:00

Alive

0

[]

N6

116

2024-05-28 10:00:00

Alive

0

[]

 

Node N2 stops sending heartbeats. N1 detects that N2 might be down after a certain number of continuous intervals. Suppose N1 suspects that N2 is down after 3 consecutive missed heartbeats.

 

N1 starts gossiping the suspect information. During its gossip round, N1 randomly selects N3 to gossip with and shares its membership table. N1 tells N3, "I suspect N2 is down; I haven't heard from it since 2024-05-28 09:59:50."

 

N3 checks and updates its table

N3 checks its own table. If N3 also hasn’t received a heartbeat from N2, it increases its suspicion level for N2.

 

N3 might add N4 to its Indirect Ping List to verify the status of N2. N3 performs an indirect ping to N2 through N4. N3 asks N4 if it has received a recent heartbeat from N2. If N4 also confirms that it hasn't heard from N2, N3’s suspicion count for N2 increases.

 

The updated membership table of N3 looks like this:

 

Node Id

Heartbeat counter

Timestamp

Status

Suspect Counter

Indirect Ping list

N1

123

2024-05-28 10:03:00

Alive

0

[]

N2

118

2024-05-28 09:59:50

Alive

1

[N4]

N3

118

2024-05-28 10:03:00

Alive

0

[]

N4

115

2024-05-28 10:03:00

Alive

0

[]

N5

122

2024-05-28 10:03:00

Alive

0

[]

N6

119

2024-05-28 10:03:00

Alive

0

[]

 

In the next gossip round, N1 might choose N4 and share its suspicion about N2 being down. Similarly, N3 might gossip with N5 and share the same suspicion. As a result, more nodes add N2 to their Indirect Ping Lists to check its status.

 

After several gossip rounds, multiple nodes (like N1, N3, and N5) confirm that they haven't received a heartbeat from N2 for the timeout period. The suspect counter for N2 increases in each node's table, and if the suspicion continues, the nodes eventually mark N2 as failed.

 

Points to remember

1.   Periodic Gossip Rounds: Each node in the cluster initiates a gossip round at regular intervals, typically every second. This ensures that information is continuously disseminated throughout the network.

2.   Flexible Gossip Partners: Nodes have the flexibility to gossip with any other node in the cluster. This decentralized approach allows for efficient communication and spread of information across the network.

3.   Variable Gossip Targets: During a gossip round, a node can choose to gossip with one to three other nodes. This flexibility in selecting gossip targets helps in achieving a balance between spreading information and minimizing network overhead.

4.   Lack of Gossip Tracking: Nodes do not maintain a record of which nodes they have gossiped with in previous rounds. This simplifies the protocol implementation and reduces memory overhead, as nodes focus solely on exchanging information with current gossip partners.

5.   Reliable Information Propagation: Despite the decentralized and randomized nature of gossiping, the protocol reliably spreads node metadata across the cluster. Over time, this leads to a consistent and up-to-date view of the cluster's state among all nodes, facilitating effective coordination and decision-making.

 

                                                                                System Design Questions

No comments:

Post a Comment