Programming for beginners: Inserting Data into an Apache Pinot Offline Table: A Beginner's Guide

Apache Pinot is a real-time distributed OLAP (Online Analytical Processing) datastore designed for ultra-fast query performance on large-scale data. It is widely used for analytical applications, including user-facing analytics, anomaly detection, and real-time dashboards.

In this guide, I will walk through inserting data into an offline table in Apache Pinot. We will cover key concepts like schema, table definition, offline tables, and batch ingestion, along with a practical example.

Understanding Key Concepts

Before diving into the data ingestion process, let's clarify some key terms:

1. Schema

A schema defines the structure of data stored in Pinot, specifying column names, data types, and field categories.

Pinot supports three types of fields:

· Dimension Fields: Attributes used for filtering and grouping (e.g., order_id, customer_id, status)

· Metric Fields: Numeric values used for aggregation (e.g., total_amount)

· DateTime Fields: Time-based fields for temporal queries (e.g., order_date)

schema.json

{
  "schemaName": "orders", 
  "enableColumnBasedNullHandling": false,
  "dimensionFieldSpecs": [
    {
      "name": "order_id",
      "dataType": "STRING",
      "fieldType": "DIMENSION"
    },
    {
      "name": "customer_id",
      "dataType": "STRING",
      "fieldType": "DIMENSION"
    },
    {
      "name": "status",
      "dataType": "STRING",
      "fieldType": "DIMENSION"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "total_amount",
      "dataType": "DOUBLE",
      "fieldType": "METRIC"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "order_date",
      "dataType": "TIMESTAMP",
      "fieldType": "DATE_TIME",
      "format": "TIMESTAMP",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

Let’s understand what is in the schema.

a. schemaName: "orders"

This names the schema "orders", meaning it's defining the structure for order data.

b. enableColumnBasedNullHandling: false

When "enableColumnBasedNullHandling" is set to false, Pinot does not explicitly track null values. Instead, it follows the default transformation behavior, where:

· Numeric columns (e.g., INT, DOUBLE): Default to 0

· String columns: Default to "null" (as a string, not an actual null)

· Boolean columns: Default to false

· Timestamp columns: Default to 0 (epoch time)

This behavior is necessary to ensure that Pinot can efficiently build inverted indexes, forward indexes, and dictionary encodings without needing extra null markers.

When enableColumnBasedNullHandling is set to true, then Pinot

Allows columns to explicitly define their null handling and supports two approaches:

· Default null replacement values (same as above but configurable per-column)

{
  "name": "total_amount",
  "dataType": "DOUBLE",
  "defaultNullValue": -1.0
}

When you want to distinguish between "real zeros" and "nulls" (e.g., 0.0 = actual zero, -1.0 = null).

· Native null support (stores actual nulls): Pinot tracks null values explicitly instead of replacing them. Queries can use IS NULL and IS NOT NULL conditions to filter records. This requires additional storage and memory overhead to maintain null bitmaps.

c. Field Types: The schema defines three types of fields:

c.1. Dimension Fields (dimensionFieldSpecs)

These are categorical fields used for filtering and grouping:

· order_id: A string that uniquely identifies each order

· customer_id: A string identifying the customer who placed the order

· status: A string representing the order status (like "shipped", "pending", etc.)

c.2 Metric Fields (metricFieldSpecs)

These are numeric fields used for calculations:

· total_amount: A double-precision number storing the order's total cost

c.3. DateTime Fields (dateTimeFieldSpecs)

These are time-based fields:

· order_date: A timestamp field storing when the order was placed

o Format: TIMESTAMP (exact date and time)

o Granularity: 1 millisecond (very precise timing)

This schema tells Pinot:

· What fields to expect in the "orders" data

· What type each field is (string, number, timestamp)

· How to handle each field during queries and analytics

For example, Pinot will know:

· You can sum/average total_amount (it's a metric)

· You can filter/group by customer_id or status (they're dimensions)

· You can do time-based analysis on order_date

1.1 How to insert Schema to Pinot?

Open the pinot WebApplication by navigating to the url ‘http://localhost:9000’.

Click on ‘Swagger REST API’ to open the swagger endpoint, and execute the API ‘POST /schema’ endpoint by passing the schema payload.

Upon successful execution of the API, you can see successful response like below.

1.2 How to verify the schema in Pinot UI?

Navigate to the pinot UI ‘http://localhost:9000’.

Click on TABLES, you will be navigated to the tables and schemas page.

Under the SCHEMAS section, you can see orders schema with some high-level summary like

a. Total Dimension columns: 3

b. Total Data-Time columns: 1

c. Total Metrics columns: 1

Click on orders schema, you will be takes to schema json.

2. Table Definition

A table in Apache Pinot is where data is stored and indexed. It can be of three types:

· REALTIME: For streaming ingestion (e.g., Kafka, Pulsar)

· OFFLINE: For batch ingestion (e.g., files like JSON, CSV, Parquet)

· HYBRID: A combination of both REALTIME and OFFLINE tables

In our case, we are working with an offline table, where data is ingested in batches.

table.json

{
  "tableName": "orders",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "timeColumnName": "order_date",
    "timeType": "MILLISECONDS",
    "replication": "1"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "invertedIndexColumns": ["order_id", "customer_id", "status"],
    "rangeIndexColumns": ["total_amount"],
    "noDictionaryColumns": [],
    "starTreeIndexConfigs": [],
    "enableDefaultStarTree": false
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "metadata": {
    "customConfigs": {}
  },
  "ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY"
    }
  },
  "schemaName": "orders"
}

tableName: The name of your table (orders).

tableType: OFFLINE means this table stores batch data (as opposed to REALTIME for streaming data).

schemaName: References the schema (orders) that defines the columns and data types.

Segments Configuration

{
 "segmentsConfig": {
   "timeColumnName": "order_date",
   "timeType": "MILLISECONDS",
   "replication": "1"
 }
}

· timeColumnName: The column (order_date) used to partition data by time (useful for time-based queries and retention).

· timeType: The precision of the time column (MILLISECONDS).

· replication: How many copies of each data segment to store (1 = no replication). Higher values improve fault tolerance but use more storage.

Indexing Configuration

{
 "tableIndexConfig": {
  "loadMode": "MMAP",
  "invertedIndexColumns": ["order_id", "customer_id", "status"],
  "rangeIndexColumns": ["total_amount"],
  "noDictionaryColumns": [],
  "enableDefaultStarTree": false
 }
}

· loadMode: How data is loaded into memory (MMAP = memory-mapped files, balances speed and memory usage).

· invertedIndexColumns: Columns with an inverted index for fast filtering (e.g., WHERE status = 'shipped').

· rangeIndexColumns: Columns optimized for range queries (e.g., WHERE total_amount BETWEEN 100 AND 200).

· noDictionaryColumns: Columns that skip dictionary encoding (rarely used; empty here).

· enableDefaultStarTree: Whether to use a star-tree index for pre-aggregated queries (false = disabled).

Tenants Configuration

{
 "tenants": {
   "broker": "DefaultTenant",
   "server": "DefaultTenant"
 }
}

Pinot uses multi-tenancy to isolate resources. Here, both the query engine (broker) and storage layer (server) use the default tenant.

Metadata & Custom Configs

{
 "metadata": {
   "customConfigs": {}
 }
}

Placeholder for custom key-value settings (empty in this case).

Ingestion Configuration

{
  "ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY"
    }
  }
}

· segmentIngestionType: APPEND means new data is added without modifying existing segments.

· segmentIngestionFrequency: How often batch data is ingested (DAILY = once per day).

In summary,

· This is an offline table: It’s optimized for batch data (not real-time streams).

· Time-based partitioning: Data is organized by order_date for efficient time-range queries.

· Indexes are critical:

o Inverted indexes speed up filtering on order_id, customer_id, and status.

o Range indexes optimize queries on total_amount.

o Simple deployment: Uses default tenants and no custom metadata.

How to onboard the table definition to Pinot?

Navigate to Pinot Swagger, and execute the API ‘POST /tables’ by passing the table definition.

Upon successful insertion of table definition, you can see following message.

Table orders_OFFLINE successfully added"

Pinot automatically appended _OFFLINE to your table name (orders) to indicate this is a batch/offline table (not real-time).

This is a naming convention Pinot uses internally:

· orders_OFFLINE → Batch data (loaded periodically, e.g., from files).

· orders_REALTIME → Streaming data (e.g., from Kafka).

You can see orders_OFFLINE table from Pinot UI as well.

3. What is the difference between schema and table in Pinot

Pinot separates schema and table because they serve distinct purposes in its distributed, columnar design.

3.1 Schema (What the data looks like)

Defines the structure of the data:

· Column names, data types (STRING, INT, etc.).

· Field classifications: DIMENSION (for filtering/grouping), METRIC (for aggregations), DATE_TIME (for time-based queries).

The schema is a logical contract applied to incoming data. For example, orders schema defines that total_amount is a DOUBLE METRIC field.

3.2 Table (How the data is stored and indexed)

Defines physical storage properties:

· Where data comes from (e.g., Kafka, batch files).

· How data is partitioned/sharded.

· Indexing rules (inverted indexes, bloom filters, etc.).

· Retention policies (e.g., keep data for 30 days).

Table references a schema to understand the data structure.

With this separation, the same schema can be reused across multiple tables (e.g., orders_realtime and orders_offline).

4. Insert data to orders_OFFLINE table

Sample data to ingest.

data.json

[
  {
    "order_id": "O1001",
    "customer_id": "C001",
    "status": "SHIPPED",
    "total_amount": 250.75,
    "order_date": 1711924800000
  },
  {
    "order_id": "O1002",
    "customer_id": "C002",
    "status": "PENDING",
    "total_amount": 100.50,
    "order_date": 1712001200000
  },
  {
    "order_id": "O1003",
    "customer_id": "C003",
    "status": "DELIVERED",
    "total_amount": 325.00,
    "order_date": 1712087600000
  },
  {
    "order_id": "ORD-1001",
    "customer_id": "CUST-001",
    "status": "COMPLETED",
    "total_amount": 125.99,
    "order_date": 1672531200000
  },
  {
    "order_id": "ORD-1002",
    "customer_id": "CUST-002",
    "status": "PENDING",
    "total_amount": 89.50,
    "order_date": 1672617600000
  }
]

To upload this JSON file into Pinot, we use the ingestFromFile API.

curl -X POST -F file=@data.json -H "Content-Type: multipart/form-data" "http://localhost:9000/ingestFromFile?tableNameWithType=orders_OFFLINE&batchConfigMapStr={"inputFormat":"json"}"

Since query parameters need encoding, the final command becomes:

curl -X POST -F file=@data.json \
  -H "Content-Type: multipart/form-data" \
"http://localhost:9000/ingestFromFile?tableNameWithType=orders_OFFLINE&batchConfigMapStr=%7B%22inputFormat%22%3A%22json%22%7D"

$ curl -X POST -F file=@data.json \
>   -H "Content-Type: multipart/form-data" \
>   "http://localhost:9000/ingestFromFile?tableNameWithType=orders_OFFLINE&batchConfigMapStr=%7B%22inputFormat%22%3A%22json%22%7D"
{"status":"Successfully ingested file into table: orders_OFFLINE as segment: orders_1672531200000_1712087600000_1743269144055"}

5. Query orders table

Go to Pinot UI ‘http://localhost:9000’, click on "Query Console" link available at left navigation bar.

Click on orders table link, you can see a select query is populated and executed by default, and show the results.

In summary, we covered:

· Understanding Pinot schema and table definitions

· Creating an offline table for batch ingestion

· Using cURL to ingest JSON data into Pinot

· Querying the inserted data

Apache Pinot provides fast analytics on large datasets, making it a great choice for interactive dashboards and monitoring applications. If you are working with batch data, offline tables provide a powerful way to manage historical data efficiently.

Previous Next Home

Programming for beginners

Tuesday, 15 July 2025

Inserting Data into an Apache Pinot Offline Table: A Beginner's Guide

No comments:

Post a Comment