Friday, 6 June 2025

BigQuery Partitioned Tables: Boost Query Performance & Cut Costs

Are your BigQuery queries slow and expensive? Learn how partitioned tables can supercharge performance, reduce scanning costs, and enable parallel processing.

 

This beginner-friendly guide covers:

 

·      What are partitioned tables?

·      How do they improve query performance?

·      Different ways to partition the table

 

1. What is a Partitioned Table?

A partitioned table in BigQuery is a table that is divided into smaller, more manageable segments (partitions) based on a column value. This helps to optimize performance and reduce query costs by scanning only the relevant partitions instead of the entire table.

 

Let’s say we have a table sales_data that stores sales transactions.

CREATE TABLE my_dataset.sales_data (
    order_id STRING,
    customer_id STRING,
    order_date TIMESTAMP,
    amount FLOAT64
);

This table is not partitioned, meaning every query scans the entire dataset, even if we filter by order_date.

 

Partitioned Table (Optimized):

CREATE TABLE my_dataset.sales_data_partitioned
PARTITION BY DATE(order_date)
AS SELECT * FROM my_dataset.sales_data;

Here, the table is partitioned by order_date, meaning data is stored in separate partitions based on different dates. Queries filtering by order_date will scan only the required partitions instead of the full dataset.

 

The sales_data_partitioned table inherit the schema from sales_data, because I am creating it using AS SELECT *.

 

2. Advantages Of Partition Table

 

2.1 Query Performance Improvement: Since only the relevant partitions are scanned, query performance is faster compared to a non-partitioned table.

 

For Example, if a user queries data for 2024-04-01, only that partition is scanned instead of the entire dataset.

 

SELECT * 
FROM my_dataset.sales_data_partitioned
WHERE order_date = '2024-04-01';

Only the 2024-04-01 partition is scanned.

 

Query on Non-Partitioned Table (Slow & Expensive)

SELECT * 
FROM my_dataset.sales_data
WHERE order_date = '2024-04-01';

Scans the entire table, increasing cost and slowing performance.

 

2.2 Reduced Scanning (Lower Costs & Faster Results)

BigQuery charges based on the amount of data scanned. Partitioning reduces unnecessary scanning, leading to lower costs.

 

Example:

·      A non-partitioned table with 100GB will always scan 100GB.

·      A partitioned table where each partition is 10GB will scan only 10GB if the query is filtered correctly.

 

2.3. Increased Parallelism

·      Partitioning allows BigQuery to process queries in parallel, making execution faster.

·      Each partition is queried separately, leading to better resource utilization.

·      This is especially useful for big datasets with frequent queries.

 

3. How to Partition a Table?

Partitioning can be applied in three ways:

 

·      Ingestion-Time Partitioning

·      Partitioning by Date/Timestamp Column

·      Partitioning by Integer Column

 

3.1 Ingestion-Time Partitioning

BigQuery automatically partitions data based on the time it was ingested (loaded). No need to manually specify a partitioning column.

CREATE TABLE my_dataset.ingestion_partitioned_table
PARTITION BY _PARTITIONTIME
AS SELECT * FROM my_dataset.sales_data;

_PARTITIONTIME is an inbuilt identifier to specify the ingestion time.

 

3.2 Partitioning by Date/Timestamp Column

Best for time-series data (e.g., sales, logs, transactions). You must specify a DATE or TIMESTAMP column for partitioning.

CREATE TABLE my_dataset.sales_data_partitioned
PARTITION BY DATE(order_date)
AS SELECT * FROM my_dataset.sales_data;

Use this, when your queries frequently filter by date.

 

3.3 Partitioning by Integer Column

Used when data is naturally divided by numeric ranges (e.g., Customer ID, Region ID).

 

Use this, when data is frequently queried by numeric values, like user IDs, product categories, or region codes.

 

In summary, Partitioned tables are a great way to improve query performance, reduce costs, and enable parallel processing in BigQuery.

 

·      Always choose a partitioning strategy based on your query patterns.

·      Use date-based partitioning for time-series data.

·      Consider integer partitioning for datasets divided by numeric categories.

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment