Apache Pinot is a real-time distributed OLAP datastore designed for low-latency analytics. When working with batch data, we need to ingest data into a Pinot offline table. This guide walks you through creating a schema, table, sample data file, and job specification file. We will also demonstrate how to submit the ingestion job using pinot-admin.sh.
Step 1: Onboard Schema
Create a JSON file named batch_schema.json that defines the schema for our table.
schema.json
{
"schemaName": "customer_purchases",
"dimensionFieldSpecs": [
{"name": "customerId", "dataType": "STRING"},
{"name": "customerName", "dataType": "STRING"},
{"name": "purchaseAmount", "dataType": "DOUBLE"}
],
"metricFieldSpecs": [],
"dateTimeFieldSpecs": [
{
"name": "purchaseDate",
"dataType": "LONG",
"format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}
]
}
Navigate to Pinot Swagger UI, and execute “POST /schemas” API with above payload.
Upon successful execution of the API, you will receive following response.
{
"unrecognizedProperties": {},
"status": "customer_purchases successfully added"
}
Step 2: Onboard table definition.
table.json
{
"tableName": "customer_purchases",
"schemaName": "customer_purchases",
"tableType": "OFFLINE",
"segmentsConfig": {
"timeColumnName": "purchaseDate",
"timeType": "MILLISECONDS",
"replication": "1"
},
"tableIndexConfig": {
"loadMode": "MMAP",
"invertedIndexColumns": ["customerId", "purchaseAmount"],
"rangeIndexColumns": ["purchaseAmount"],
"noDictionaryColumns": [],
"starTreeIndexConfigs": [],
"enableDefaultStarTree": false
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"metadata": {
"customConfigs": {}
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY"
}
}
}
Execute the API “POST tables” to onboard customer_purchases table definition.
Step 3: Create Sample Data
Create a CSV file named customerPurchases.csv with sample data.
customerPurchases.csv
customerId,customerName,purchaseAmount,purchaseDate C001,John Doe,100.5,1711929600000 C002,Jane Smith,250.0,1712016000000 C003,Robert Brown,175.75,1712102400000
Step 4: Create a Job Specification File
Create a YAML file named batch_ingestion_job.yml that defines the batch ingestion job.
ingestionJobSpec.yaml
executionFrameworkSpec: name: "standalone" segmentGenerationJobRunnerClassName: "org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner" segmentTarPushJobRunnerClassName: "org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner" segmentUriPushJobRunnerClassName: "org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner" pinotFSSpecs: - scheme: "file" className: "org.apache.pinot.spi.filesystem.LocalPinotFS" jobType: "SegmentCreationAndTarPush" inputDirURI: "file:///{INPUT_DIRECTORY_PATH}" includeFileNamePattern: "glob:**/*.csv" outputDirURI: "file:///{OUTPUT_DIRECTORY_PATH}" overwriteOutput: true tableSpec: tableName: "customer_purchases" pinotClusterSpecs: - controllerURI: "http://localhost:9000" recordReaderSpec: dataFormat: "csv" className: "org.apache.pinot.plugin.inputformat.csv.CSVRecordReader" configClassName: "org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig" configs: header: "customerId,customerName,purchaseAmount,purchaseDate" delimiter: "," skipHeader: "true" pushJobSpec: pushAttempts: 3 pushRetryIntervalMillis: 1000 pushParallelism: 1
Replace {INPUT_DIRECTORY_PATH}", {OUTPUT_DIRECTORY_PATH}" with actual folder paths.
Step 5: Submit the Job
Run the following command to submit the batch ingestion job:
pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /path/to/batch_ingestion_job.yml
Upon successful execution of the above command, you can see following output.
You might wonder, why can’t I use the Pinot REST API directly to save my data, why should I create a Job?
Using the REST API is like bringing items one by one instead of delivering them in bulk.
What Happens When You Use a REST API?
· You send raw data to Pinot using an API request.
· Pinot stores raw records in a "consuming" segment (in-memory buffer).
· When the segment reaches a threshold (time/size), Pinot converts it into an immutable segment.
· Small segments may pile up, leading to "segment fragmentation."
What Happens When You Use a Batch Job?
· Reads bulk data (CSV/JSON/Parquet) from a file system (S3, HDFS, local).
· Pre-processes data into optimized segments (sorted, compressed, indexed).
· Uploads segments directly to Pinot (bypassing real-time ingestion overhead).
Previous Next Home



No comments:
Post a Comment