Partitioning splits a large BigQuery table into smaller, manageable pieces called partitions based on a column (commonly of type DATE, TIMESTAMP, or INTEGER).
Instead of scanning the entire table, BigQuery only scans relevant partitions, resulting in:
· Faster query execution
· Lower costs
· Improved performance
1. Types of Partitioning
Type |
Description |
Date/Time Partitioning |
Based on a DATE or TIMESTAMP column (e.g., order_date, event_time) |
Ingestion-Time |
Automatically partitions by load time using __PARTITIONTIME |
Integer-Range |
Partitioning Based on an integer column using a defined range |
2. Best Practices on Partitioning
2.1 Choose the Right Partition Column: Pick a low-cardinality column that’s commonly used in filters, like event_date or created_at. Avoid columns like user_id or UUID, use clustering for those instead.
· Low-Cardinality column: A column with few distinct values (e.g., event_date, country, status).
· High-Cardinality: A column with many distinct values (e.g., user_id, transaction_id). High-cardinality columns (e.g., user_id) have thousands of unique values. If you partition by such a column, you’ll end up with thousands of tiny partitions, and each partition may contain just a few KB/MB of data, violating big data’s "large sequential scan" optimization principle.
2.2 Always Filter by the Partition Column
Use WHERE clauses on the partition column in your queries:
SELECT * FROM my_table WHERE event_date BETWEEN '2023-01-01' AND '2023-01-31'
2.3 Avoid Functions on the Partition Column
When writing queries in BigQuery on a partitioned table, it’s important to write your WHERE clause in a way that helps BigQuery to quickly find the relevant data. This is called partition pruning (Partition pruning is the process where BigQuery automatically skips over partitions that don’t match your filter. But this only works if you don’t apply any function or transformation to the partition column in the filter.), and it helps to reduce cost and improve performance by scanning only the necessary partitions
Effective Query Example
Take a look at following query
SELECT * FROM my_table WHERE event_date = '2023-01-01'
This is good because:
· event_date is used as-is, without any transformation.
· BigQuery can easily identify and scan only the partition that matches '2023-01-01'.
· This makes your query faster and cheaper.
Not Effective Query Example
Take a look at following query
WHERE DATE(event_date) = '2023-01-01'
This is not performance and cost effectient because:
· You're using a function (DATE()) on the partition column.
· When you apply a function to the partition column, BigQuery can't tell which specific partition to scan.
· As a result, BigQuery may scan all partitions, which makes your query slower and more expensive.
2.4 Use Integer-Range Partitioning for Non-Date Data
If you don’t have a date column but have an integer field (e.g., region_id):
PARTITION BY RANGE_BUCKET(region_id, GENERATE_ARRAY(1, 100, 10))
2.5 Combine Partitioning with Clustering
When working with large datasets in BigQuery, partitioning helps to break the data into large chunks (like one partition per day), and clustering helps to organize data within those chunks. Using both together makes your queries faster and more cost-efficient.
For example,
CREATE TABLE sales.data PARTITION BY DATE(order_date) CLUSTER BY customer_id, product_id AS SELECT * FROM raw_data;
Here’s what this does:
· PARTITION BY DATE(order_date): Divides the data by date (1 partition per day).
· CLUSTER BY customer_id, product_id: Sorts the rows inside each date partition by customer_id and then product_id.
This setup makes queries like the one below more efficient:
SELECT * FROM sales.data WHERE order_date = '2024-01-01' AND customer_id = 'CUST123';
Here,
· BigQuery will scan only the partition for 2024-01-01.
· Within that partition, it can quickly jump to rows with customer_id = 'CUST123'.
2.6 Monitor Partition Size and Count
BigQuery allows you to partition your tables to improve query performance and reduce costs. But it's important to make sure your partitions are the right size and number, not too big and not too small.
Partitions should be:
· Large enough to be efficient to scan around 100 MB to 1 GB
· Small enough to allow selective partition pruning (so BigQuery only reads the partitions it needs)
This range helps to maintain a balance:
· If your partitions are too small (like 5 MB each), BigQuery has to manage thousands of tiny pieces — this is inefficient and slows things down.
· If partitions are too big, BigQuery ends up scanning a lot of unnecessary data within that partition — defeating the purpose of partitioning.
How to Monitor Partition Size and Count?
You can use this query to see your table's partitions and how many rows each contains:
SELECT * FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS` WHERE table_name = 'your_table';
You can see information like total_rows, total_logical_bytes, total_billable_bytes etc.,
This helps you check:
· Are some partitions too small (under 100 MB)?
· Are there too many partitions?
· Should you consider switching to monthly partitions instead of daily?
2.7 Use __PARTITIONTIME for Ingestion-Time Partitioned Tables
SELECT * FROM my_table WHERE __PARTITIONTIME >= TIMESTAMP('2023-01-01')
2.8 Don’t Over-Partition
· Max limit: 4,000 partitions per table
· Avoid creating partitions per user or high-cardinality fields.
2.9 Set Expiration for Temporary Data
Automatically delete old data with partition expiration:
CREATE TABLE temp_logs PARTITION BY DATE(event_time) OPTIONS (partition_expiration_days = 365) AS SELECT * FROM raw_logs;
Previous Next Home
No comments:
Post a Comment