When working with large datasets in BigQuery, performance and cost become critical factors. BigQuery provides features like partitioning and clustering to optimize query performance and reduce costs. While partitioning divides data based on a column (like a date), clustering organizes the data within those partitions (or ) across the entire table if partitions are not defined for this table.
This guide focuses specifically on clustering without partitioning, using a simple example with Indian-origin names.
Clustering in BigQuery
Clustering is a way of organizing the physical layout of data in a BigQuery table based on the contents of one or more columns (up to 4 columns maximum).
When a table is clustered:
· BigQuery automatically sorts the data in storage based on the clustering columns.
· During queries, data pruning happens, only relevant blocks are read and make queries faster and cheaper.
How is Clustering Different from Partitioning?
Feature |
Partitioning |
Clustering |
Purpose |
Divides table into parts (partitions) |
Organizes rows inside the table/partition |
Based on |
Date, Timestamp, Integer column |
Up to 4 columns of any data type |
Query Filtering |
Reduces data scanned if partition filter used |
Reduces data scanned via sorting and pruning |
Limits |
1 partition column only and, maximum 4000 partitions for a table |
Max 4 clustering columns
|
When to use |
For large, time-based data |
For frequent filtering on a few columns |
Limitations of Clustering
· Maximum 4 clustering columns per table.
· Cannot recluster existing data manually (BigQuery reclusters data automatically over time).
· Clustering effectiveness reduces if the data is frequently updated or unordered on those fields.
· Not ideal if queries don’t use clustering columns in WHERE clause.
Creating a Clustered Table in BigQuery
CREATE TABLE `my_project.my_dataset.customers_clustered` ( customer_id INT64, first_name STRING, last_name STRING, city STRING, state STRING, email STRING, signup_date DATE ) CLUSTER BY city, state;
In this table, clustering is done on city and state, assuming queries often filter by these.
Insert Sample Records
INSERT INTO `my_project.my_dataset.customers_clustered` ( customer_id, first_name, last_name, city, state, email, signup_date ) VALUES (1, 'Aarav', 'Sharma', 'Mumbai', 'Maharashtra', 'aarav.sharma@example.com', '2024-01-10'), (2, 'Vivaan', 'Mehta', 'Pune', 'Maharashtra', 'vivaan.mehta@example.com', '2024-02-15'), (3, 'Diya', 'Patel', 'Ahmedabad', 'Gujarat', 'diya.patel@example.com', '2024-03-01'), (4, 'Ishaan', 'Reddy', 'Hyderabad', 'Telangana', 'ishaan.reddy@example.com', '2024-04-05'), (5, 'Anaya', 'Nair', 'Kochi', 'Kerala', 'anaya.nair@example.com', '2024-05-20'), (6, 'Kabir', 'Singh', 'Chandigarh', 'Chandigarh', 'kabir.singh@example.com', '2024-06-18'), (7, 'Meera', 'Das', 'Bhubaneswar', 'Odisha', 'meera.das@example.com', '2024-07-09'), (8, 'Reyansh', 'Mishra', 'Lucknow', 'Uttar Pradesh', 'reyansh.mishra@example.com', '2024-08-22'), (9, 'Saanvi', 'Tripathi', 'Varanasi', 'Uttar Pradesh', 'saanvi.tripathi@example.com', '2024-09-30'), (10, 'Advait', 'Chatterjee', 'Kolkata', 'West Bengal', 'advait.chatterjee@example.com', '2024-10-10'), (11, 'Myra', 'Roy', 'Kolkata', 'West Bengal', 'myra.roy@example.com', '2024-11-11'), (12, 'Arjun', 'Rao', 'Bangalore', 'Karnataka', 'arjun.rao@example.com', '2024-12-01');
How Clustering Helps in Query Performance?
Example Query:
SELECT first_name, last_name, email FROM `my_project.my_dataset.customers_clustered` WHERE city = 'Kolkata' AND state = 'West Bengal';
Since clustering is done on city and state, BigQuery scans fewer blocks, making the query cheaper and faster.
How can I get clustering information for a specific table?
Run the following query to get clustering information for a specific table.
SELECT * FROM `my_project.my_dataset.INFORMATION_SCHEMA.COLUMNS` WHERE clustering_ordinal_position IS NOT NULL AND table_name = 'your_table_name';
For customers_clustered table, I received following information.
In Summary,
· Use clustering when your queries frequently filter or group by specific columns.
· Clustering works best for read-heavy tables with append-only workloads.
· Unlike partitioning, clustering doesn't restrict you to date-based columns.
· Combine with partitioning for maximum optimization on large datasets.
Previous Next Home
No comments:
Post a Comment