Wednesday 29 November 2023

Measuring System Reliability: A Comprehensive Guide

 

Reliability means how well something, like a system or a service, does its job and keeps doing it right. If something is reliable, you can trust it to work the way it's supposed to, even when things are complex.

 

Different ways to measure Reliability

1. Mean Time Between Failures (MTBF)

MTBF is a measure that figures out how much time, on an average, passes between one problem and the next problem in a fixable system. If the MTBF is higher, it means the system is more trustworthy because it can go for a longer time without having issues.

 

Formula

MTBF = Total Time the System Works / Number of Times it Breaks

 

For example, let's say there's a service running for a total of 10 hours, and during that time, it stops working 5 times. To find the MTBF for this service:

 

MTBF = 10 hours / 5 failures = 2 hours

 

2. Mean Time To Repair (MTTR)

Mean Time To Repair (MTTR) is an important thing to look at when dealing with microservices. It tells us, on an average, how much time it takes to fix a microservice that's not working properly.

 

Formula

MTTR = Total Time Spent Fixing / Number of Times Something Went Wrong

 

Where,

a.   Total Time Spent Fixing: It is the total amount of time spent fixing all the problems that happened during the time we were watching.

b.   Number of Failures: This is how many times the microservice didn't work as expected.

 

For instance, if total time spent to repair is 10 hours and total failures are 5, then

 

MTTR = 10 hours / 5 failures = 2 hours

 

So, on average, it takes 2 hours to fix a microservice when it's not working right.

 

3. Availability

Availability is a measure of how often a system or component is operational and accessible to users. We usually talk about it as a percentage, where higher percentages mean the system is available more.

 

For instance, if a system has 99.5% availability means that it is supposed to be working 99.5% of the time, and there's only 0.5% of the time when it's not working.

 

Formula

Availability = (Total Uptime / Total Time) * 100%

 

where:

 

a.   Total Uptime is the total amount of time the system or component was operational.

b.   Total Time including uptime and downtime.

 

For example, if the total time is 1000 hours and total up time is 999 hours, then

Availability = (999 / 1000) * 100 = 99.9 percentage

 

4. Failure Rate

Failure rate is the number of failures that occur in a system or component per unit of timesuch as failures per hour. A higher failure rate means there's a higher chance of things not working.

 

Formula

Failure Rate = (Number of Failures / Total Exposure Time)

 

where:

For example, if a system has 10 failures in 100 hours, then the failure rate is:

 

Failure Rate = (10 / 100 hours) = 0.1 failures/hour

 

5. Reliability Block Diagrams (RBD)

Think of a big system with lots of connected parts like microservices. Each little part has its own reliability. RBD used to model the flow of success or failure through a system and to calculate the overall reliability of the system as a whole.

 

How RBD Works?

An RBD is drawn as a series of connected blocks , where each block represent components or subsystems of the system. The blocks are connected in a series or parallel configuration. A series connection indicates that all blocks must be in the success state for the system to be in the success state. A parallel connection indicates that only one block needs to be in the success state for the system to be in the success state.

 

Series Configuration

If they're in a line, it means the system only works if all the parts in a row work. The overall reliability (R_sys) is how good each part is, multiplied together.

 

Example

Think about a payment system with three parts (Authentication and Authorization Service, Payment Service, Database) in a series.

 


Authentication and Authorization Service reliability R1 =  0.98

Payment service reliability R2 =  0.94

Database reliability R3 = 0.99

Total reliability of the system = R1 * R2 * R3 = 0.9119

So, the system reliability in this serial configuration is 0.855 or 91.19%.

 

Parallel configuration

In a parallel configuration, the system is operational if at least one of the components is operational. The overall reliability (R_sys) is the complement of the probability that all components fail simultaneously.

 


 

If the reliability of service 1 is R1 = 0.99

If the reliability of service 2 is R2 = 0.95

the overall system reliability (R_sys) is given by:

R_sys = 1 - (1-R1) * (1 - R2)

      = 1−(1−0.99)×(1−0.95)

      = 1 - (0.01 * 0.05)

      = 1 - 0.005

      = 0.9995

 

So reliability of the system is 99.95%

 

6. Service level Agreements (SLAs)

SLA is a contract between a service provider and a customer that defines the level of service that the service provider will provide to the customer. SLAs can be used as a reliability metric to some extent. They provide a framework for defining and measuring the performance of a service, which can be used to assess its reliability.



                                                             System Design Questions

No comments:

Post a Comment