Sunday 9 June 2024

Understanding Dual Write and Effective Strategies to Prevent the Dual Write Problem

The Dual Write problem occurs when data needs to be updated in two different locations simultaneously. Dual Writes are one of the most common reasons for data inconsistencies.

 

Let me explain this with an example. Imagine you are working on an e-commerce platform where customers maintain a balance in their wallets. When a customer buys an item, the Purchase Microservice updates the user's balance in a database (e.g., deducting money for the purchase) and also sends a message to a queue (e.g., to initiate the delivery of the item to the customer) simultaneously.

 

Let's explore the various possible combinations in this scenario.

1. Successful Update of Wallet Balance and Message Queue

 

·      The Purchase Microservice successfully updates the user's balance in the database.

·      The message to the queue is successfully sent.

·      Outcome: Both operations succeed, and data remains consistent.


 

2. Successful Update of Wallet Balance, Failed Message Queue

·      The Purchase Microservice successfully updates the user's balance in the database.

·      The message to the queue fails to send.

·      Outcome: The user's balance is updated, but the item delivery is not initiated, leading to an inconsistency.


 

3. Failed Update of Wallet Balance, Successful Message Queue

 

·      The Purchase Microservice fails to update the user's balance in the database.

·      The message to the queue is successfully sent.

·      Outcome: The item delivery is initiated, but the user's balance is not updated, leading to an inconsistency.

 


4. Failed Update of Wallet Balance and failed Message Queue

 

·      The Purchase Microservice fails to update the user's balance in the database.

·      The message to the queue fails to send.

·      Outcome: Both operations fail, and no changes are made, maintaining consistency but with no progress.



How to address Dual Write Problem?

Following techniques are used to address Dual Write Problem.

 

1.   Transactional Outbox Pattern

2.   Change Data Capture (CDC)

 

1. Transactional Outbox Pattern

The Transactional Outbox Pattern is a design pattern that guarantees both the primary data update and the event publication take place within a single transaction. It achieves this by utilizing an Outbox Table within the same database to capture events destined for other data sources. The application writes event data to this outbox table alongside the primary data. Subsequently, a distinct background process regularly scans the outbox table and dispatches the events to the designated event channel (such as Kafka or RabbitMQ). Once published, these events are then consumed by various systems or services subscribed to the event channel.

 


Let's take how it works with the purchase product example. When a customer makes a purchase in an e-commerce platform, the Purchase Microservice updates the user's balance and publishes a PurchaseCompleted event to be consumed by the delivery service.

 

Step 1: Customer Makes a Purchase

The customer initiates a purchase through the Purchase Microservice.

 

Step 2: Update Balance and Insert Event into Outbox

The Purchase Microservice updates the user's balance in the database.

Simultaneously, it inserts a PurchaseCompleted event into an outbox table within the same database.

 

Step 3: Store Outbox Event

The event data is stored in the outbox table, ensuring that both the data update and event insertion occur within the same transaction.

 

Step 4: Background Worker Publishes Event

A background worker process periodically reads the outbox table.

It processes the events and publishes them to the event channel (e.g., Kafka, RabbitMQ) for consumption by other services.

 


Pros of Transactional Outbox Pattern

a.   Data Consistency: Ensures that both the balance update and the event insertion into the outbox table occur atomically within the same transaction. This atomicity guarantees that either both operations succeed, or neither does, maintaining data consistency between the microservice and the outbox table.

 

b.   Resilience and Reliability: The outbox table acts as a reliable store for events until they are successfully processed and published by the background worker. This reduces the risk of losing events in case of failures or network issues.

c.    Simplifies Microservice Architecture: The Purchase Microservice focuses solely on the business logic (e.g., updating balance), without needing to handle event publishing complexities. The separation of concerns makes the microservice easier to develop, test, and maintain.

d.   Minimizes Direct Dependencies: The Purchase Microservice does not need direct knowledge of the event channel or delivery service. This decoupling allows each component to evolve independently.

 

Cons of Transactional Outbox Pattern

a.   Complexity in Implementation: Setting up the transactional outbox pattern requires careful configuration of database transactions and background processes. Ensuring that the background process correctly reads, publishes, and marks events as processed adds to the complexity.

b.   Resource Overhead: The background process consumes additional resources (CPU, memory, and I/O) to monitor the outbox table and publish events. This can impact system performance, especially under high load conditions.

c.    Latency: There might be a delay between when an event is written to the outbox table and when it is published to the event channel. This latency could impact real-time requirements, such as immediate notification or action based on the event.

d.   Duplicate Events and Idempotency: If the background process fails after publishing an event but before marking it as processed, the event might be published again. Consumers of the event must be designed to handle potential duplicates by ensuring idempotent processing.

 

Problem with Transactional Outbox Pattern

When the background process reads from the outbox table and publishes an event to the event channel, there is a chance that it might fail to update the status in the outbox table. This failure can result in the event being published again when the background process retries, leading to duplicate events. Consumers need to handle potential duplicates. This is typically achieved by using unique identifiers for events and ensuring idempotent processing or ignore the event if it received multiple times.

 

2. Change Data Capture (CDC)

Change Data Capture (CDC) is a software design patterns, used to identify and track changes in a database. These changes are then captured (using chnage log, record time stamps, triggers etc.,) and used to replicate or synchronize data across different systems in real-time or near real-time. CDC ensures the data consistency across multiple systems.

 

How CDC works?

Scenario: Master Data Management System and HR System Integration

 

In this scenario, the Master Data Management (MDM) system serves as the centralized repository for all employee details, project information, and other company entities. Additionally, there is an HR System that needs to replicate employee information from the MDM system.

 

Challenge

The challenge arises when updates occur in the MDM system. These updates need to be synchronized with the HR System in real-time to ensure data consistency across both systems. However, directly writing data to both systems simultaneously can lead to the dual write problem, where inconsistencies may arise if one write operation succeeds while the other fails.

 

How Change Data Capture Addresses the Dual Write Problem

Step 1: Change Occurs in MDM System

A change, such as an update to an employee's details, occurs in the MDM system.

 

Step 2: Capture Changes with CDC

Change Data Capture mechanisms in the MDM system capture these changes. This could be done through techniques like monitoring

transaction logs, using database triggers or querying the employee table by updated timestamp.

 

Step 3: Generate Events

The CDC system generates events for the changes detected. These events represent the updates made to employee details.

 

Step 4: Publish Events

The generated events are published to an event channel, such as Kafka or RabbitMQ. This ensures that the events are reliably propagated to interested consumers, including the HR System.

 

Step 5: Consume Events in HR System

The HR System subscribes to the event channel and consumes the published events. Upon receiving an event, the HR System updates its own database with the changes reflected in the MDM system.



 

Pros in the Scenario

a.   Data Consistency: CDC ensures that employee details and project information remain consistent between the Master Data Management (MDM) system and the HR system.

b.   Minimal Impact: By leveraging CDC mechanisms such as transaction logs, the performance impact on the MDM system is minimized during the capture and propagation of changes to the HR system.

c.    Decoupling: The MDM system can focus solely on maintaining accurate data, while CDC handles the task of propagating changes to the HR system, resulting in a more modular and maintainable architecture.

 

Cons in the Scenario

a.   Complex Setup: Configuring CDC for the integration between the MDM system and the HR system requires careful setup and tuning of the CDC tool and database configurations.

b.   Resource Overhead: The CDC tool consumes additional resources, such as processing power and storage, to monitor changes and manage event propagation, potentially increasing the overall resource usage of the system.

c.    Latency: There may be a slight delay in propagating updates from the MDM system to the HR system via CDC, which could impact the real-time nature of certain HR operations and data access.

 

Problem to Consider in CDC

If CDC captures a change and publishes an event but fails to update the status indicating that the event has been published, it might attempt to publish the event again. This can lead to duplicate events being published. Consumers of the events need to be designed to handle potential duplicates. This can be done by ensuring idempotency in the consumer logic, where the same event processed multiple times does not result in inconsistent state, or ignore the event if it received multiple times.

 

Both CDC and the Transactional Outbox Pattern require careful handling of potential duplicates to ensure data consistency. While they significantly reduce the risk of dual writes, they do not completely eliminate the possibility of duplicate events. Hence, designing systems to be idempotent and to handle duplicates gracefully is crucial. We can address this duplicate event problem by using unique event identifiers, or by ensuring that processing an event multiple times does not change the state beyond the first processing.

 

Example pseudo code to ignore duplicate events

def process_event(event):
    if event_already_processed(event.id):
        return
    update_system_state(event)
    mark_event_as_processed(event.id)

                                                                                System Design Questions

No comments:

Post a Comment