Programming for beginners: Apache Pinot Architecture

Apache Pinot is a free, distributed system built for analyzing data in real time. It is great for handling large amounts of data and running super-fast queries, making it useful for real-time reporting and data storage.

1. Key Features of Apache Pinot:

· Super-fast data analysis, even when processing a huge number of requests.

· Stores data in columns and uses smart indexing and pre-aggregation to speed up queries.

· Can grow as needed, both by adding more resources to existing servers or increasing the number of servers.

· Delivers stable performance based on cluster size and the number of queries it needs to handle per second.

2. Key Components of Apache Pinot

Apache Pinot is a distributed system composed of four main components: Controller, Broker, Server, and Minion. Each plays a crucial role in ensuring high-performance, fault-tolerant, and scalable real-time analytics.

3. Controller

The Pinot Controller is the central management component that take care of system operations, metadata, and coordination among other components. It performs several key functions:

3.1 Metadata Management with ZooKeeper: The controller maintains global metadata such as configurations, schemas, and cluster state information. It uses ZooKeeper, which acts as the persistent metadata store, ensuring cluster consistency and fault tolerance.

3.2 Managing Other Pinot Components: The Helix Controller (hosted within the Pinot Controller) is responsible for managing brokers, servers, and minions. It ensures proper load balancing, failover handling, and state transitions within the cluster.

3.3 Segment Management and Query Routing: Pinot data is divided into segments, which are distributed across multiple servers. The controller keeps track of which servers store which segments and updates this mapping as needed. Brokers use this mapping to route queries efficiently to the appropriate servers.

3.4. Admin and Configuration Management: The controller provides REST endpoints for viewing, creating, updating, and deleting cluster configurations. These APIs allow administrators to manage the Pinot cluster dynamically without restarting services.

3.5. Handling Data Ingestion & Segment Uploads

· For offline data ingestion, the controller provides endpoints for uploading pre-built data segments.

· For real-time ingestion, the controller helps to manage Kafka stream consumption and coordinates the persistence of real-time segments into permanent storage.

3.6. Cluster Maintenance & Retention Policies: The controller ensures that old or outdated segments are removed or archived according to retention policies. It validates segment integrity and enforces consistency across different Pinot components.

How Controller avoid single point of failures?

To avoid single points of failure, multiple Pinot Controller instances can be deployed. All controllers must be configured with the same back-end storage system (e.g., NFS, HDFS, or ADLS) to ensure a consistent view of segment locations and metadata. This setup allows failover mechanisms where, if one controller goes down, another can take over seamlessly.

In summary, Pinot Controller is the brain of the system, responsible for managing metadata, coordinating Pinot components, handling data ingestion, and enforcing retention policies.

4. Broker

Apache Pinot’s broker component plays an important role in optimizing query execution, ensuring efficient data retrieval, and enhance data-driven applications. Brokers act as intermediaries between clients and Pinot servers, handle queries, distribute them efficiently, and merge results before returning them to clients.

4.1 How Brokers Work in Apache Pinot?

Query Handling and Routing: Pinot brokers accept queries from clients and forward them to the appropriate servers that store the relevant data segments. Brokers collect the results from multiple servers, consolidate them into a single response, and send them back to the client.

Helix Spectators and Segment Location Awareness: Pinot brokers function as Helix spectators, meaning they observe the cluster state but do not directly store data. They track the location of all table segments and their replicas to route queries efficiently to the correct Pinot servers. Apache Helix helps brokers determine which participants (servers) contain specific partitions (segments) of a dataset.

Query Consistency and Optimization: Brokers ensure that every row in a queried table is scanned exactly once, guaranteeing accurate and consistent results. They apply query pruning techniques to skip unnecessary segments without affecting accuracy.

5. Servers

Pinot servers are responsible for storing data segments and executing queries on the data they host. They ensure efficient data retrieval and real-time analytics.

There are two kinds of Servers:

· Offline Servers

· Online Servers

5.1 Offline Servers (Batch Data Processing)

Handle pre-loaded, batch-processed data.

How an Offline Server works?

· New data segments are uploaded to the segment store (a central storage system).

· The Pinot controller decides which servers should store these segments (based on replication settings).

· The selected servers download the segments, load them, and serve queries from this stored data.

For example, If new sales data is added daily, offline servers ensure it is downloaded and made queryable efficiently.

5.2. Real-Time Servers (Streaming Data Processing)

Handle live data ingestion from real-time sources like Kafka or EventHubs.

How They Work?

· The server ingests data directly from a real-time stream.

· It keeps the ingested data in memory and periodically creates new segments based on thresholds (like time or size).

· These segments then persisted to the segment store for long-term storage.

For example, if user activity logs are being streamed into Pinot, real-time servers ensure that live queries return the most recent data.

In summary, Pinot servers function as Helix participants, meaning they actively manage and store Pinot tables (called resources in Helix). Each segment of a table is treated as a partition, and a Pinot server can host multiple segments from different tables.

6. Minions

Minions are background workers in Pinot that handle heavy computational tasks, allow other components (like servers and brokers) to focus on their primary responsibilities.

They use the Helix Task Framework, meaning they wait in standby mode and execute tasks when assigned by the Pinot controller.

How Minions Work?

· Attached to an existing Pinot cluster: Minions don’t store or process queries directly but assist by running computational tasks.

· Controlled by the Pinot controller: The controller assigns tasks to minions.

· Pluggable custom tasks: Developers can add their own minion tasks via annotations.

Common Tasks Performed by Minions

· Segment Creation: Helps in processing raw data and generating Pinot segments.

· Segment Purge: Deletes outdated or unnecessary data segments to optimize storage.

· Segment Merge: Combines multiple small segments into larger ones to improve query performance.

7. Segment Store

A segment store is an external storage system where Pinot stores its data segments. It acts as a centralized repository for storing both offline and real-time segments so they can be downloaded and used by Pinot servers when needed.

References

https://docs.pinot.apache.org/release-1.0.0/basics/architecture

Previous Next Home

Programming for beginners

Wednesday, 2 July 2025

Apache Pinot Architecture

No comments:

Post a Comment