Saturday, 25 May 2024

Mastering the Myths: 8 Fallacies of Distributed Computing

The fallacies of distributed computing address the false assumptions engineers often make about distributed systems.

Distributed System

A Distributed System is a collection of independent but interconnected computers that appear to the users of the system as a single coherent system. These computers are connected by a computer network and are equipped with distributed system software.

 


Distributed system software enables these independent computers to coordinate their activities and share computing resources to perform operations seamlessly. This software ensures that the distributed system functions as a unified entity, providing services and resources efficiently and reliably.

 

Fallacies of Distributed Computing

Following are the assumptions that most of the developers/Architecs made while building a distributed system. Original source: https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

 

1.   The network is reliable

2.   Latency is zero

3.   Bandwidth is infinite

4.   The network is secure

5.   Topology doesn't change

6.   There is one administrator

7.   Transport cost is zero

8.   The network is homogeneous

 

The 8 fallacies are like caution signs for architects and designers of distributed systems. If they believe in these assumption/fallacies, it can lead to big problems for the system and the people who built it later on.

 

1. Network is reliable

Reliability

Reliability means that the system/information/person is trustworthy, dependable, consistent and predictable. Reliability means you can trust something to work well all the time. Like, a reliable computer always starts when you want it to. Or a reliable colleague always finishes their work on time and does it really well.

 

Reliability in Computer networks

In computer networks, reliability refers to a network's ability to consistently deliver data accurately and without errors.

 

For example, imagine you're sending an important email with an attachment to your colleague. If the network is reliable, the email and attachment will reach your colleague exactly as you sent them, without any missing or errors.

 


Suppose if your network is reliable, you can expect that a movie will play smoothly without any interruptions or buffering issues. In a unreliable network the data might not receive correctly, and is a possibility of packet loss.

 

Imagine you're trying to download a file from the internet. Let's say the file is split into three parts: P1, P2 and P3. These parts are sent from the sender to a receiver over an unreliable network. Since the network is unreliable, there's a chance that some of these parts might not reach your computer correctly.

 

For example, let's say P1 and P3 make it to Receiver just fine. However, because the network is unreliable, P2 gets lost along the way. Additionally, due to the unreliability of the network, P1 might get corrupted during transmission and arrive on your computer as P1_0, with some of its content changed.

 

As a result, when you try to open the file, you'll find that it's incomplete and possibly corrupted. This can be frustrating because you won't be able to use the file as intended, and you might have to re-download it or find an alternative solution.



Why can’t we assume the network is always reliable?

Networks are complicated systems that are always dynamic in natrue and can be hard to predict. Lots of things can make a network stop working or cause problems with it. For instance, if a switch or a piece of hardware breaks, if there's a power cut, if someone sets up the network configurations wrong, if there's a mistake made by a person, if a cyber-attack happens (like a DDoS attack), if there's a bug in the software, if the internet provider has issues, or if there's suddenly a lot more traffic on the network than usual, it can cause the network to go down.

 

Following table provides real-world examples that illustrate why networks aren't always completely reliable.

 

Description

Link

This link provides information about Distributed Denial of Service (DDoS) attack on the Domain Name System (DNS). This incident affected Dyn, a major DNS provider, resulting in widespread internet disruptions. It serves as a prime example of how malicious cyber attacks can cripple network infrastructure, leading to service outages for users worldwide.

https://en.wikipedia.org/wiki/DDoS_attacks_on_Dyn

This link provides information on various internet outages, detailing specific incidents and their causes. These incidents highlight how network failures can occur due to a range of factors, including technical issues, human error, sudden spikes in the traffic and cyber threats. Examining these cases can offer insights into the complexities of maintaining a reliable network.

https://en.wikipedia.org/wiki/Internet_outage

This link offers a summary of major network outages that occurred in 2021. By reviewing these incidents, readers can gain an understanding of the types of disruptions that networks face in modern times

https://www.bmc.com/blogs/network-outages/

 

 

2. Latency is zero

Latency is the time it takes for information to travel from one point to another. Typically measured in milliseconds, it reflects the speed at which data moves across networks and communication systems. When latency is high, it signals potential issues within the network that can affect performance.

 

Latency is almost neglizable when running applications on your local computer or within a Local Area Network (LAN). However, the story changes when accessing applications hosted in distant data centers. For instance, if your application resides in an East US data center and you're accessing it from Bangalore (India), latency becomes noticeable. This delay is a natural consequence of the physical distance between the user and the server.

 

As a system designer, it's crucial to acknowledge that latency is an inherent characteristic of networks. We should never assume that data transmission will occur instantaneously or without any delay.

 

To delve deeper into latency and its implications go through below links.

a.   Azure's network round-trip latency statistics: https://learn.microsoft.com/en-us/azure/networking/azure-network-latency

b.   GCPing (Application to Measure Latency from your location to various Google Cloud regions): https://www.gcping.com/

 

3. Bandwidth is infinite

Internet bandwidth is the maximum amount of data that can be sent over your internet connection in a certain amount of time.

 

Think of it like a pipe: a wider pipe (higher bandwidth) lets more water (data) flow through at once, compared to a narrower pipe (lower bandwidth). Bandwidth is measured in Bits per Second (bps), Kilobits per Second (Kbps), Megabits per Second (Mbps), Gigabits per Second (Gbps).

 

The higher the bandwidth, the more data can be transferred in a given time.

 

Let’s compare two users, X and Y, to see how bandwidth impacts their download speeds.

 

User X: Download Bandwidth: 5 Mbps (megabits per second)

User Y: Download Bandwidth: 10 Mbps (megabits per second)

 

Imagine both users want to download a 100 megabyte (MB) file.

100 MB = 100 * 8 = 800 megabits (Mb)

 

Now, let’s calculate the time it takes for each user to download the file.

 

User X: Download Bandwidth: 5 Mbps

Time to Download = Total File Size / Download Speed = 800 Mb / 5 Mbps = 160 seconds

 

User Y: Download Bandwidth: 10 Mbps

Time to Download = Total File Size / Download Speed = 800 Mb / 10 Mbps = 80 seconds

 


Assuming infinite bandwidth while building distributed applications is unrealistic and can lead to several critical issues. Even with high bandwidth, latency (the delay in data transmission) remains a significant factor. High bandwidth does not eliminate the time it takes for data to travel across distances, which can affect application performance, especially in real-time systems. Ignorance of bandwidth limits on the part of traffic senders can result in bottlenecks.

 

Bandwidth vs speed

Bandwidth represents the potential maximum rate at which data can be transferred. Speed is the actual rate at which data is transferred, which can be influenced by various factors. For instance, if User X has a bandwidth of 5 Mbps, they should ideally experience data transfer speeds up to 5 Mbps. However, factors like network congestion can cause the user X to experience lower speeds.

 

You can use online applications such as 'https://www.speedtest.net/' to test your download and upload speeds.

 

4. The network is secure

Never assume the network is secure. Hackers are always ready with their tools, such as malware, viruses, worms, phishing, DDoS attacks, IP spoofing, Trojan horses, and packet sniffers, to gain control over the network. Always design your applications with security in mind and follow best practices (https://owasp.org/www-project-top-ten/) to minimize security risks.

 

5. Topology doesn't change

Network topology refers to how nodes within a computer network are organized. When we have multiple computers or servers, we arrange them in specific ways to enable communication between all nodes.

 

Here are different types of network topologies:

 

a.   Bus Topology

b.   Ring Topology

c.    Star Topology

d.   Mesh Topology

e.   Hybrid Topology

 

Bus Topology

In a bus topology, all nodes are connected to a single communication cable called a bus, this cable transmits signals to all devices simultaneously.

 


In a bus topology, when a device wants to send information, it broadcasts it onto the shared communication medium, which is typically a cable. Every device connected to the bus receives the signal, but only the device with the intended address processes it. This is achieved by including a unique address with each data packet, allowing devices to determine whether the transmitted data is meant for them. Suppose A want to send a packet to H, then it broadcast the packet to every other node in the bus topology.


 


Advantage

Disadvantages

Only one wire, less expensive.

 

Node failures doesn't affect other nodes

It is not fault tolerant, as there is no redundancy in the communication channel. If the transmission channel/wire failed, then entire network will go down.

 

Limited Cable length.

 

In a bus topology, since all devices share the same communication medium (the bus), only one device can transmit data at a time. If two devices attempt to send data simultaneously, a collision occurs, potentially corrupting the transmitted data. To mitigate collisions in bus topology, protocols like CSMA/CD (Carrier Sense Multiple Access with Collision Detection) are used.

 

Ring Topology

In Ring topology, all the devices are connected in a closed loop. Communication in ring topology is unidirectional in general.

 


Communication in the nodes happen via a token. Token is a special data packet called the  circulates around the ring continuously. Token will be with only one node at any point of time, whoever owns the token, they are allowed to transfer the data.

 


As you see above image, right now Device B owns the token T, so B is allowed to transfer the data in ring topology, others are restricted from transferring the data.

 

Star Topology

In a Star topology, each node links to a central node known as a hub or switch. There are no direct connections between devices; instead, all data traffic is routed through the central hub. I recommend watching the video about Hubs, Switches, and Routers for further understanding: (https://www.youtube.com/watch?v=1z0ULvg_pW8)




In a Star topology, when a node wishes to transmit data to another node, it sends the data to the central Hub/Switch. It is then the responsibility of the Hub/Switch to forward the data to the intended receiver. Unlike in Bus and Ring topologies, where data may be received by all connected devices, in a Star topology, data transmission is directed specifically to the intended recipient, eliminating the possibility of broadcasting to all nodes.

 

A serious drawback of star topology is, when the central node (Hub/Switch) fails, then the entire network is failed.

 

d. Mesh Topology

In mesh topology, each node is directly connected to other node in the network. It is fault tolerant and reliable.

 

As you see the below image, each node has multiple ways to reach the other node.

 


The downside of this topology is that it is expensive and impractical to connect huge nodes in this topology.

 

e. Hybrid Topology

Hybrid topology is a combination of more than one different topologies.


Changes in network topology can impact both bandwidth and latency, leading to similar issues. Hence, considering network topology is vital when developing a distributed application for several reasons. The selection of network topology can greatly influence the performance of your distributed application. Various topologies offer different levels of latency, bandwidth, and reliability. By choosing the right topology, you can enhance data transmission efficiency and reduce network bottlenecks.

 

Just remember that your system should be able to handle changes in network topology (how devices are connected to the network) without causing any interruptions in service. This means that even if one device stops working, the rest of the system should keep running smoothly. So, don't rely too much on any one device. Make sure your system has backups in place so it can keep working even if something goes wrong with one of the devices.

 

6. There is one administrator

When dealing with the challenges of managing distributed systems, having more than one administrator can make things more complicated. If an application is complex, it might need several administrators to handle tasks like setting rules, configuring, and keeping it running smoothly. When designing such systems, it's important to make it easy for these administrators to work together and not accidentally mess things up. You also need to make sure that changes made by one person don't cause problems for everyone else. This ensures that the system remains stable and reliable, even with multiple people involved in managing it

 

7. Transport cost is zero

Moving data from one device to another always takes time, uses up the device's resources and incurs a cost. When data moves between different parts of a system, it has to be prepared for travel and then put back together once it arrives. To make this process quicker and less resource-intensive, it's better to use lightweight formats like JSON, MessagePack, or Protocol Buffers instead of heavier options like XML. These lightweight formats help reduce the time and resources needed for data transfer.

 

8. The network is homogeneous

In general, distributed systems have to connect with lots of different devices, handle different operating systems, and work smoothly with various web browsers. They also need to communicate with other systems. That's why it's really important to make sure all these different parts can understand each other, even though they're different. One way to do this is by using open standard and widely used protocols instead of ones that are unique to a specific system. For example, protocols like HTTPS and WebSockets are good choices because they're well-supported across different platforms. The same idea applies to how data is formatted. It's often better to use standard formats like JSON because they're easier for different systems to work with.

 

In summary, the 8 Fallacies of Distributed Computing talks about the common misconceptions that can lead to problems when designing and managing distributed systems. It's very important to remember that distributed systems are inherently complex and prone to issues such as latency, network failures, and data consistency challenges. By understanding and addressing these fallacies, developers and engineers can build more resilient and efficient distributed systems.

 

References

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing



                                                                                System Design Questions

No comments:

Post a Comment