Tuesday, 10 June 2025

Best Practices on Partitioning in BigQuery

Partitioning splits a large BigQuery table into smaller, manageable pieces called partitions based on a column (commonly of type DATE, TIMESTAMP, or INTEGER).

Instead of scanning the entire table, BigQuery only scans relevant partitions, resulting in:

·      Faster query execution

·      Lower costs

·      Improved performance

 

1. Types of Partitioning

Type

Description

Date/Time Partitioning 

Based on a DATE or TIMESTAMP column (e.g., order_date, event_time)

Ingestion-Time 

Automatically partitions by load time using __PARTITIONTIME

Integer-Range

Partitioning  Based on an integer column using a defined range

 

2. Best Practices on Partitioning

 

2.1 Choose the Right Partition Column: Pick a low-cardinality column that’s commonly used in filters, like event_date or created_at. Avoid columns like user_id or UUID, use clustering for those instead.

 

·      Low-Cardinality column:  A column with few distinct values (e.g., event_date, country, status).

·      High-Cardinality: A column with many distinct values (e.g., user_id, transaction_id). High-cardinality columns (e.g., user_id) have thousands of unique values. If you partition by such a column, you’ll end up with thousands of tiny partitions, and each partition may contain just a few KB/MB of data, violating big data’s "large sequential scan" optimization principle.

 

2.2 Always Filter by the Partition Column

Use WHERE clauses on the partition column in your queries:

SELECT * FROM my_table
WHERE event_date BETWEEN '2023-01-01' AND '2023-01-31'

2.3 Avoid Functions on the Partition Column

When writing queries in BigQuery on a partitioned table, it’s important to write your WHERE clause in a way that helps BigQuery to quickly find the relevant data. This is called partition pruning (Partition pruning is the process where BigQuery automatically skips over partitions that don’t match your filter. But this only works if you don’t apply any function or transformation to the partition column in the filter.), and it helps to reduce cost and improve performance by scanning only the necessary partitions

 

Effective Query Example

Take a look at following query

SELECT * FROM my_table
WHERE event_date = '2023-01-01'

This is good because:

·      event_date is used as-is, without any transformation.

·      BigQuery can easily identify and scan only the partition that matches '2023-01-01'.

·      This makes your query faster and cheaper.

 

Not Effective Query Example

Take a look at following query

WHERE DATE(event_date) = '2023-01-01'

 

This is not performance and cost effectient because:

·      You're using a function (DATE()) on the partition column.

·      When you apply a function to the partition column, BigQuery can't tell which specific partition to scan.

·      As a result, BigQuery may scan all partitions, which makes your query slower and more expensive.

 

2.4 Use Integer-Range Partitioning for Non-Date Data

If you don’t have a date column but have an integer field (e.g., region_id):

PARTITION BY RANGE_BUCKET(region_id, GENERATE_ARRAY(1, 100, 10))

2.5 Combine Partitioning with Clustering

When working with large datasets in BigQuery, partitioning helps to break the data into large chunks (like one partition per day), and clustering helps to organize data within those chunks. Using both together makes your queries faster and more cost-efficient.

 

For example,

CREATE TABLE sales.data
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, product_id
AS SELECT * FROM raw_data;

Here’s what this does:

 

·      PARTITION BY DATE(order_date): Divides the data by date (1 partition per day).

·      CLUSTER BY customer_id, product_id: Sorts the rows inside each date partition by customer_id and then product_id.

 

This setup makes queries like the one below more efficient:

SELECT * 
FROM sales.data
WHERE order_date = '2024-01-01'
AND customer_id = 'CUST123';

Here,

·      BigQuery will scan only the partition for 2024-01-01.

·      Within that partition, it can quickly jump to rows with customer_id = 'CUST123'.

 

2.6 Monitor Partition Size and Count

BigQuery allows you to partition your tables to improve query performance and reduce costs. But it's important to make sure your partitions are the right size and number, not too big and not too small.

 

Partitions should be:

·      Large enough to be efficient to scan around 100 MB to 1 GB

·      Small enough to allow selective partition pruning (so BigQuery only reads the partitions it needs)

 

This range helps to maintain a balance:

·      If your partitions are too small (like 5 MB each), BigQuery has to manage thousands of tiny pieces — this is inefficient and slows things down.

·      If partitions are too big, BigQuery ends up scanning a lot of unnecessary data within that partition — defeating the purpose of partitioning.

 

How to Monitor Partition Size and Count?

You can use this query to see your table's partitions and how many rows each contains:

SELECT * 
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'your_table';

You can see information like total_rows, total_logical_bytes, total_billable_bytes etc.,  

 


This helps you check:

 

·      Are some partitions too small (under 100 MB)?

·      Are there too many partitions?

·      Should you consider switching to monthly partitions instead of daily?

 

2.7 Use __PARTITIONTIME for Ingestion-Time Partitioned Tables

SELECT * FROM my_table
WHERE __PARTITIONTIME >= TIMESTAMP('2023-01-01')

 

2.8 Don’t Over-Partition

·      Max limit: 4,000 partitions per table

·      Avoid creating partitions per user or high-cardinality fields.

 

2.9 Set Expiration for Temporary Data

Automatically delete old data with partition expiration:

CREATE TABLE temp_logs
PARTITION BY DATE(event_time)
OPTIONS (partition_expiration_days = 365)
AS SELECT * FROM raw_logs;

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment