Partitioning in BigQuery is a powerful feature that improves query performance and cost efficiency. However, it comes with strict limitations that can impact long-term data retention and ingestion strategies.
This post explores the key partition limits in BigQuery, what happens when you reach them, and the best strategies to manage and optimize partitioned tables.
1. Key Partition Limitations in BigQuery
1.1 Maximum Number of Partitions per Table: 4,000
A single partitioned table can contain a maximum of 4,000 partitions.
If you partition by DATE (YYYY-MM-DD), you’ll hit this limit after approximately 11 years (4,000 days ≈ 10.96 years).
1.2 Maximum Partitions Modified per Job: 4,000
A single job (query, load, or DML operation) cannot modify more than 4,000 partitions at once. This includes INSERT, UPDATE, DELETE, and MERGE operations.
1.3 Maximum Partitions Modified Per Day (Per Project):
· Ingestion-time partitioned tables: 5,000 partitions/day
· Column-partitioned tables: 30,000 partitions/day
· If a project exceeds these limits, queries and data ingestion may fail.
2. What Happens When You Hit the Partition Limit?
Once a table reaches 4,000 partitions, BigQuery stops creating new partitions.
Here’s what happens:
· Existing partitions continue to function, queries on old data work fine.
· New data for future dates is rejected and any attempt to insert into a new partition fails.
· DML operations for new dates fail, you cannot insert or update data into a new partition.
3. Solutions to Avoid Hitting the Partition Limit
Option 1: Implement Partition Rotation (Delete Old Data)
If historical data isn’t frequently used, set up a data retention policy:
DELETE FROM `my_project.my_dataset.my_table` WHERE date_column < DATE_SUB(CURRENT_DATE(), INTERVAL 10 YEAR);
Above snippet ensures that you always have the latest 10 years of data. Frees up old partitions for new data.
Option 2: Use Monthly/Yearly Partitioning Instead of Daily
Partitioning by month or year significantly reduces the number of partitions:
CREATE TABLE `my_project.my_dataset.my_table` PARTITION BY DATE_TRUNC(date_column, MONTH) AS SELECT * FROM existing_table;
Above snippet partitions the table by month using the date_column. DATE_TRUNC(date_column, MONTH) converts date_column to the first day of the month. All rows within the same month will be stored in the same partition.
For example, if date_column has values:
2024-04-15
2024-04-20
2024-05-10
Then:
· Rows with 2024-04-15 and 2024-04-20 will be in the April 2024 partition (2024-04-01).
· Row with 2024-05-10 will be in the May 2024 partition (2024-05-01).
Option 3: Two-Level Partitioning (Date + Clustering)
Combine partitioning by year with clustering by date to improve efficiency:
CREATE TABLE `my_project.my_dataset.my_table` PARTITION BY EXTRACT(YEAR FROM date_column) CLUSTER BY date_column AS SELECT * FROM existing_table;
Above snippet partitions the table by year.
If date_column contains:
2022-06-15
2022-12-20
2023-01-10
· Rows with 2022-06-15 and 2022-12-20 will be in the 2022 partition.
· Row with 2023-01-10 will be in the 2023 partition.
CLUSTER BY date_column
lustering organizes data within each partition by sorting it based on date_column. It helps improve query performance by reducing the amount of data scanned.
Suppose your 2022 partition has:
2022-01-05
2022-06-15
2022-09-20
2022-12-31
Since it's clustered by date_column, when you run:
SELECT * FROM my_table WHERE date_column = '2022-06-15'
BigQuery skips unnecessary rows, and scan only relevant data.
Since clustering does not have a strict limit like partitions, you can store millions of distinct dates inside a single yearly partition. Queries that filter by a specific date still perform efficiently because:
· Data is physically stored in sorted blocks and fewer scanned rows is enough to address the query.
· BigQuery automatically optimizes storage layout.
· Queries that filter on date_column are still efficient, even though it's not a partition key.
If you find that queries scanning large yearly partitions are slow, consider partitioning by month (YYYY-MM) instead of a full year.
CREATE TABLE `my_project.my_dataset.my_table` PARTITION BY DATE_TRUNC(date_column, MONTH) CLUSTER BY date_column AS SELECT * FROM existing_table;
In summary, Partitioning is a powerful tool in BigQuery, but exceeding its limits can cause data ingestion failures and query errors. By using monthly partitioning, partition rotation, clustering, and data archiving, you can ensure your BigQuery tables remain scalable and efficient for long-term use.
Previous Next Home
No comments:
Post a Comment