Circuit Breaker is a design pattern, it helps to prevent cascading failures and enables the application to handle errors gracefully.
Cascading failure
Cascading failure is a scenario, where one part of the system triggers sequence of failures in other parts of the system. Let me explain it with an example.
Let me explain this with an e-commerce application. Assume the e-commerce application has the following microservices:
a. User Service
This service manages all user-related information. It handles details such as the user's name, multiple delivery addresses, phone numbers, email addresses, and other personal information. The User Service ensures that all necessary user data is available for other services that need it.
b. Product Service
The Product Service provides comprehensive details about the various products available in the catalogue. This includes product descriptions, prices, availability, and other relevant information. It serves as the central repository for product data used across the application.
c. Coupon Service
The Coupon Service manages the discount coupons. Users can apply these coupons to receive discounts on their purchases. Since coupon discounts are designed to specific users and products, this service depends on both the User Service (to identify eligible users) and the Product Service (to apply discounts to specific products).
d. Order Service
The Order Service is responsible for handling the entire order processing workflow. It relies on multiple other services:
1. User Service: For retrieving user-related information such as delivery addresses.
2. Product Service: For accessing detailed information about the products included in an order.
3. Coupon Service: For applying any applicable coupons or discounts to the order.
e. Payment Service
The Payment Service processes payments for
orders. It depends on the Order Service to obtain the details of the order that
needs to be paid for. This includes the final amount after any discounts have
been applied, as well as user and order information necessary for processing
the payment.
Let’s take an example, how a failure in Product service is cascaded to all the other services. Assume that the Product Service begins to experience high latency due to a sudden spike in traffic, a bug in the code, or network latency when fetching product details from the database or cache.
Impact on Order Service
The Order Service, which relies on the Product Service to fetch product details, starts experiencing delays and timeouts. As a result, orders begin to fail because the Order Service cannot retrieve the necessary product information in time.
Impact on Coupon Service
The Coupon Service, which also depends on the Product Service for product details, faces similar delays and timeouts. Consequently, coupon applications begin to fail as the Product Service cannot provide the required information in a timely manner.
Overload on User Service
Users get frustrated and start retrying their actions, such as placing orders and applying coupons, reloading the pages repeatedly. This leads to a significant increase in traffic to the User Service. The User Service, which was previously operating smoothly, now experiences a heavy load, resulting in performance degradation or failures.
The initial failure in the Product Service starts affecting the Order Service and Coupon Service. This chain reaction continues to impact other services, eventually leading to a system-wide outage. What began as a problem in the Product Service has now escalated into a complete failure of the e-commerce platform.
One way to address this by providing a fallback response. Let me explain this with an example.
ProductService.java
package com.sample.app.services;
public class ProductService {
public String callService() {
// Simulate a delay or failure
if (Math.random() > 0.7) {
throw new RuntimeException("ProductService is down");
}
return "ProductService response";
}
}
OrderService.java
package com.sample.app.services;
import com.netflix.hystrix.HystrixCommand;
import com.netflix.hystrix.HystrixCommandGroupKey;
public class OrderService {
private ProductService productService = new ProductService();
public String getProductDetails() {
return new ProductServiceCommand(productService).execute();
}
private class ProductServiceCommand extends HystrixCommand<String> {
private final ProductService prodService;
protected ProductServiceCommand(ProductService prodService) {
super(HystrixCommandGroupKey.Factory.asKey("ProductService"));
this.prodService = prodService;
}
@Override
protected String run() throws Exception {
return prodService.callService();
}
@Override
protected String getFallback() {
return "Fallback response";
}
}
}
FallbackResponse.java
package com.sample.app;
import com.sample.app.services.OrderService;
public class FallbackResponse {
public static void main(String[] args) {
OrderService orderService = new OrderService();
for (int i = 0; i < 10; i++) {
System.out.println(orderService.getProductDetails());
}
}
}
Sample Output
ProductService response Fallback response Fallback response ProductService response ProductService response ProductService response ProductService response Fallback response ProductService response ProductService response
How Fallback response is implemented?
a. Queuing the orders and do Asynchronous processing: If the Product Service isn't working, we can save the orders in a line and tell users there's a problem. We'll let them know when it's fixed. When the Product Service works again, we'll start processing the saved orders. After the orders are done, we'll tell users by email or text, but not right away.
b. Retry Mechanism with exponential delay Strategy: Implement a retry mechanism with an exponential delay strategy. This means that if the ProductService is unavailable, you wait for a short period before retrying the call. If the retry fails, you wait for a longer period before retrying again. This approach reduces the load on the ProductService during downtime and increases the chances of successful calls when it becomes available.
If we keep on retrying without increasing the wait time on every retry attempt, Each retry attempt generates additional requests to the ProductService. If the ProductService is already experiencing high latency or downtime, these additional requests can further overload the service, making the situation even more worse. The increased load from retry attempts can lead to resource exhaustion on the ProductService's servers. This can include CPU utilization, memory usage, and network bandwidth. Eventually, the servers may become unresponsive or crash, exacerbating the downtime.
c. Serve from the cache if information is available: Cache product information locally within the Order Service. If the ProductService is unavailable, you can use the cached product information to fulfill orders. However, be cautious of data consistency and expiration when implementing cache.
Let’s rewrite the application using Hystrix advanced capabilities.
Circuit breaker states
Circuit Breaker Pattern has three states
a. CLOSED
b. OPEN
c. HALF-OPEN
CLOSED state
In the "CLOSED" state, everything is operating as expected, representing the happy path scenario. All microservices are operational, and the circuit allows requests to pass through the service seamlessly. As illustrated in the diagram below, the Order Service initiates a call, which is then routed through the circuit breaker. The circuit breaker manages the call to the product service and returns the response accordingly.
OPEN State
In the scenario involving the Order Service and Product Service, when the Circuit Breaker timeouts reach a preconfigured value during a call to the Product Service, it indicates that the Product Service may be experiencing slowness or is not functioning as expected. This triggers the Circuit Breaker to trip, transitioning it to an "OPEN" state.
In this "OPEN" state, any incoming requests to the Order Service will be met with an error response, and no calls to the Product Service are executed. This behavior effectively prevents further load on the Product Service, reducing the strain on its resources and allowing it time to recover from the failure or slowness.
HALF-OPEN
In the context of the Order and Product services, after a certain duration, the Circuit Breaker transitions to a "HALF-OPEN" state. During this state, the Circuit Breaker intermittently sends trial requests to the Product Service to assess whether it has recovered from the previous failure. If the trial requests to the Product Service still result in timeouts or failures, it indicates that the Product Service has not recovered, and the Circuit Breaker remains in the "OPEN" state.
How to configure Circuit Breaker in Hystric?
private static final Setter PRODUCT_SERVICE_HYSTRIC_CONFIG = Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey("MyProductService"))
.andCommandPropertiesDefaults(HystrixCommandProperties
.Setter()
.withExecutionTimeoutInMilliseconds(1000)
.withCircuitBreakerEnabled(true)
.withCircuitBreakerRequestVolumeThreshold(10)
.withCircuitBreakerErrorThresholdPercentage(50)
.withCircuitBreakerSleepWindowInMilliseconds(5000)
.withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.SEMAPHORE));
Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("ProductService"))
This line creates a HystrixCommandGroupKey object with the name "ProductService". A command group key is used to group related Hystrix commands together. Commands within the same group share certain configurations, such as thread pool settings and circuit breaker properties. This grouping helps manage and configure commands more effectively.
By setting the group key, you can apply common configurations to all commands within the same group. For example, if you have multiple commands related to ProductService, you can group them together under the "ProductService" group and apply common circuit breaker settings, thread pool configurations, and fallback behaviours to all of them. This provides consistency and simplifies management of Hystrix commands in your application.
withExecutionTimeoutInMilliseconds(1000)
Sets the timeout for each execution of the command to 1000 milliseconds (1 second). If the command execution takes longer than this timeout, it will be considered a failure.
withCircuitBreakerEnabled(true)
Enables the circuit breaker for the command. The circuit breaker is a mechanism that prevents the command from executing if it detects that the system is in a degraded state.
withCircuitBreakerRequestVolumeThreshold(10)
Sets the minimum number of requests within a rolling window that must occur before the circuit breaker can consider tripping. In this case, the circuit breaker will only trip if there are at least 10 requests within the window.
In the context of a circuit breaker, "tripping" refers to the state transition of the circuit breaker from closed to open. When a circuit breaker "trips open," it means that it has detected a problem with the external system or service it is protecting, and it prevents further requests from being sent to that system.
In Hystrix, the circuit breaker has a feature called a "sleep window," which specifies the duration for which the circuit remains open after it has been tripped. During this sleep window, the circuit breaker does not allow any requests to pass through, giving the failing service time to recover.
After the sleep window expires, the circuit breaker enters a "half-open" state. In this state, the circuit breaker allows a small number of requests to pass through to the protected system to determine if it has recovered. If these requests succeed without errors, the circuit breaker transitions back to the "closed" state, indicating that the protected system is healthy again.
withCircuitBreakerErrorThresholdPercentage(50)
Sets the percentage of requests that must fail within the rolling window for the circuit breaker to trip open. In this case, if 50% or more of the requests fail, the circuit breaker will open.
withCircuitBreakerSleepWindowInMilliseconds(5000)
Sets the time that the circuit breaker will sleep before allowing another request through to see if the circuit is healthy again. In this case, the circuit breaker will sleep for 5000 milliseconds (5 seconds) before allowing another request.
withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.SEMAPHORE)
Sets the isolation strategy for the command to SEMAPHORE. This means that the command will be executed within the same thread as the caller rather than in a separate thread pool.
Find the below working application.
MyProductService.java
package com.sample.app.services;
import java.util.Random;
public class MyProductService {
private int counter = 1;
public String callService() {
System.out.println("MyProductService called for " + counter++ + " time");
//int randomTime = 800 + new Random().nextInt(100, 500);
int randomTime = 1100;
TimeUtil.sleep(randomTime);
return "MyProductService finished";
}
}
MyOrderService.java
package com.sample.app.services;
import com.netflix.hystrix.HystrixCommand;
import com.netflix.hystrix.HystrixCommandGroupKey;
import com.netflix.hystrix.HystrixCommandProperties;
import com.netflix.hystrix.HystrixCommand.Setter;
public class MyOrderService {
private static final Setter PRODUCT_SERVICE_HYSTRIC_CONFIG = Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey("MyProductService"))
.andCommandPropertiesDefaults(HystrixCommandProperties.Setter().withExecutionTimeoutInMilliseconds(1000)
.withCircuitBreakerEnabled(true).withCircuitBreakerRequestVolumeThreshold(10)
.withCircuitBreakerErrorThresholdPercentage(50).withCircuitBreakerSleepWindowInMilliseconds(5000)
.withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.SEMAPHORE));
private MyProductService productService = new MyProductService();
public String getProductDetails() {
return new ProductServiceCommand(productService).execute();
}
private class ProductServiceCommand extends HystrixCommand<String> {
private final MyProductService prodService;
protected ProductServiceCommand(MyProductService prodService) {
super(PRODUCT_SERVICE_HYSTRIC_CONFIG);
this.prodService = prodService;
}
@Override
protected String run() throws Exception {
return prodService.callService();
}
@Override
protected String getFallback() {
return "Fallback response\n";
}
}
}
TimeUtil.java
package com.sample.app.services;
import java.util.concurrent.TimeUnit;
public class TimeUtil {
public static void sleep(int noOfMilliseconds) {
try {
TimeUnit.MILLISECONDS.sleep(noOfMilliseconds);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
HystrixDemo.java
package com.sample.app;
import com.sample.app.services.MyOrderService;
import com.sample.app.services.TimeUtil;
public class HystrixDemo {
public static void main(String[] args) {
MyOrderService orderService = new MyOrderService();
for (int i = 0; i < 20; i++) {
System.out.println(orderService.getProductDetails());
if(i > 10) {
TimeUtil.sleep(1000);
}
}
}
}
Output
MyProductService called for 1 time Fallback response MyProductService called for 2 time Fallback response MyProductService called for 3 time Fallback response MyProductService called for 4 time Fallback response MyProductService called for 5 time Fallback response MyProductService called for 6 time Fallback response MyProductService called for 7 time Fallback response MyProductService called for 8 time Fallback response MyProductService called for 9 time Fallback response MyProductService called for 10 time Fallback response Fallback response Fallback response Fallback response Fallback response Fallback response Fallback response MyProductService called for 11 time Fallback response Fallback response Fallback response
Based on the output, it's clear that the circuit breaker triggers an open state after 10 consecutive failed requests, as configured. During this state, the OrderService refrains from calling the ProductService and provides a fallback response. After the specified sleep window of the circuit breaker (set at 5000 milliseconds), Hystrix attempts to access the ProductService once more. If the attempts still fail, the circuit breaker remains in the open state.
You can download this application from this link.
System Design Questions
No comments:
Post a Comment