Programming for beginners: Real-time Data Processing: Traditional RDBMS vs Modern Systems

Conventional RDBMS systems are designed for OLTP (Online Transaction Processing) workloads, emphasizing strong transaction consistency and durability. In simple terms, these systems are fine-tuned for efficient storage and retrieval of data but are less suited for processing data in real-time as it is generated.

Interacting with RDBMS involves a three-step procedure.

Step 1: Insert data into a database.

Step 2: After indexing the data and completing the transaction, the data becomes accessible for subsequent query processing.

Step 3: The data is now prepared for querying, user can read the data

This "process-after-store" model forms the core principle of all traditional DBMSs.

In real-time applications, the storage operation, which must occur before processing, adds a significant delay/latency in the application, which is not acceptable for real time processing engines.

Here are some reasons why conventional RDBMS systems are not well-suited for real-time processing:

Latency: RDBMS systems often exhibit high latency, resulting in a significant delay for data to be written to disk and then retrieved. This is impractical for real-time processing, where data must be handled promptly upon generation.

Scalability: Traditional RDBMS systems lack the scalability needed for efficient real-time processing. Without proper scaling, these systems can struggle to manage a rapid data inflow.

What are the popular real time data or stream processing engines?

Apache Kafka Streams, Apache Flink, Apache Storm, Apache Spark streaming, Redis streams, Amazon Kinesis, Google Cloud Dataflow, Pulsar etc.,

Is stream processing engines do in-memory processing?

Yes, stream processing engines typically do in-memory processing. This means that they process data as it arrives in memory, rather than writing it to disk first. This methodology substantially decreases latency and enhances performance. This approach enables faster data processing and analytics on the fly, making stream processing engines well-suited for applications like real-time analytics, monitoring, and event-driven architectures.

Is stream engines need to persist the data?

Stream processing engines typically do not inherently persist data. The primary focus of stream processing is to process the data in real-time as it is generated, rather than storing it for subsequent analysis. Nevertheless, there are situations where persisting data from a stream becomes necessary.

For instance, in certain applications, it is crucial to maintain a record of processed events for compliance, historical analysis, or system recovery from failures. Stream processing frameworks often provide connectors or integrations with databases, data lakes, or other storage systems, enabling users to store pertinent information.

Several methods exist for persisting data from a stream:

Database Storage: Writing data to a database is a common method, although it can be slow and inefficient, especially with large datasets.

Distributed Filesystem: Storing data in a distributed filesystem offers a more scalable solution for persisting stream data. Distributed filesystems are designed to handle significant data volumes and are accessible by multiple systems.

Stream Processing Engines with Persistence Support: Some stream processing engines, such as Apache Flink and Apache Spark Streaming, inherently support persistence. This simplifies the process of persisting data from a stream without requiring extensive infrastructure considerations.

It's essential to note that persisting data from a stream introduces trade-offs. If low latency is a priority, avoiding data persistence may be preferable. However, if data durability is paramount, then persistence becomes a necessary aspect of the streaming architecture.

Can I use asynchronous persistence to not impact performance?

Absolutely, employing asynchronous persistence is a common approach to alleviate the impact on real-time processing performance. With asynchronous persistence, the stream processing engine can continue handling incoming data without waiting for the complete writing of data to persistent storage.

Rather than synchronously writing data to persistent, the stream processing engine can buffer or batch the data. One the buffer is full, the actual write operation is then delegated to a background process. This approach enables the engine to seamlessly progress with processing new data, independent of the completion of the storage operation.

By employing asynchronous persistence, you can achieve a balance between real-time processing performance and data persistence.

Following skelton application demonstrate the application, where it keep on receiving the real time CPU metrics, and update the min and max cpu consumption, and persist the same.

SystemMetrics.java

package com.sample.app;

public class SystemMetrics {
	private Double minCPUConsumption = Double.MAX_VALUE;
	private Double maxCPUConsumption = Double.MIN_VALUE;
	
	public SystemMetrics() {
		
	}

	public SystemMetrics(Double minCPUConsumption, Double maxCPUConsumption) {
		this.minCPUConsumption = minCPUConsumption;
		this.maxCPUConsumption = maxCPUConsumption;
	}

	public Double getMinCPUConsumption() {
		return minCPUConsumption;
	}

	public void setMinCPUConsumption(Double minCPUConsumption) {
		this.minCPUConsumption = minCPUConsumption;
	}

	public Double getMaxCPUConsumption() {
		return maxCPUConsumption;
	}

	public void setMaxCPUConsumption(Double maxCPUConsumption) {
		this.maxCPUConsumption = maxCPUConsumption;
	}

	@Override
	public String toString() {
		return "SystemMetrics [minCPUConsumption=" + minCPUConsumption + ", maxCPUConsumption=" + maxCPUConsumption
				+ "]";
	}

}

StreamProcessingAndPersist.java

package com.sample.app;

import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.TimeUnit;
import java.util.stream.Stream;

public class StreamProcessingAndPersist {

	public static void main(String[] args) throws InterruptedException {
		final SystemMetrics systemMetrics = new SystemMetrics();
		final BlockingQueue<Double> blockingQueue = new LinkedBlockingDeque<>();

		// Stream that generate the CPU consumption details
		Stream<Double> stream = Stream.generate(() -> {
			return readCPUUSage();

		});

		Thread t1 = new Thread() {
			public void run() {

				stream.forEach(cpuConsumption -> {
					// Process the stream of data
					if (systemMetrics.getMaxCPUConsumption() < cpuConsumption) {
						systemMetrics.setMaxCPUConsumption(cpuConsumption);
					}

					if (systemMetrics.getMinCPUConsumption() > cpuConsumption) {
						systemMetrics.setMinCPUConsumption(cpuConsumption);
					}

					System.out.println("systemMetrics : " + systemMetrics);
					// Persist the data asynchronously
					blockingQueue.add(cpuConsumption);
				});
			}
		};

		Thread t2 = new Thread() {
			public void run() {

				Double cpuConsumption;
				try {
					while ((cpuConsumption = blockingQueue.take()) != null) {
						System.out.println("cpuConsumption : " + cpuConsumption);
					}
				} catch (InterruptedException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
			}
		};

		t2.start();
		t1.start();

		TimeUnit.SECONDS.sleep(1);
		System.exit(0);

	}

	private static Double readCPUUSage() {
		return Math.random() * 100;
	}
}

System Design Questions

Programming for beginners

Tuesday, 5 December 2023

Real-time Data Processing: Traditional RDBMS vs Modern Systems

No comments:

Post a Comment