Tuesday, 10 June 2025

Understanding Clustering in BigQuery

When working with large datasets in BigQuery, performance and cost become critical factors. BigQuery provides features like partitioning and clustering to optimize query performance and reduce costs. While partitioning divides data based on a column (like a date), clustering organizes the data within those partitions (or ) across the entire table if partitions are not defined for this table.

This guide focuses specifically on clustering without partitioning, using a simple example with Indian-origin names.

 

Clustering in BigQuery

Clustering is a way of organizing the physical layout of data in a BigQuery table based on the contents of one or more columns (up to 4 columns maximum).

 

When a table is clustered:

·      BigQuery automatically sorts the data in storage based on the clustering columns.

·      During queries, data pruning happens, only relevant blocks are read and make queries faster and cheaper.

 

How is Clustering Different from Partitioning?

Feature

Partitioning

Clustering

Purpose

Divides table into parts (partitions)

Organizes rows inside the table/partition

Based on

Date, Timestamp, Integer column

Up to 4 columns of any data type

Query Filtering

Reduces data scanned if partition filter used

Reduces data scanned via sorting and pruning

Limits

1 partition column only and, maximum 4000 partitions for a table

Max 4 clustering columns

 

When to use

For large, time-based data

For frequent filtering on a few columns

 

Limitations of Clustering

·      Maximum 4 clustering columns per table.

·      Cannot recluster existing data manually (BigQuery reclusters data automatically over time).

·      Clustering effectiveness reduces if the data is frequently updated or unordered on those fields.

·      Not ideal if queries don’t use clustering columns in WHERE clause.                

                 

                 

Creating a Clustered Table in BigQuery

CREATE TABLE `my_project.my_dataset.customers_clustered`
(
  customer_id INT64,
  first_name STRING,
  last_name STRING,
  city STRING,
  state STRING,
  email STRING,
  signup_date DATE
)
CLUSTER BY city, state;

In this table, clustering is done on city and state, assuming queries often filter by these.

 

Insert Sample Records

 

INSERT INTO `my_project.my_dataset.customers_clustered` (
  customer_id, first_name, last_name, city, state, email, signup_date
) VALUES
(1, 'Aarav', 'Sharma', 'Mumbai', 'Maharashtra', 'aarav.sharma@example.com', '2024-01-10'),
(2, 'Vivaan', 'Mehta', 'Pune', 'Maharashtra', 'vivaan.mehta@example.com', '2024-02-15'),
(3, 'Diya', 'Patel', 'Ahmedabad', 'Gujarat', 'diya.patel@example.com', '2024-03-01'),
(4, 'Ishaan', 'Reddy', 'Hyderabad', 'Telangana', 'ishaan.reddy@example.com', '2024-04-05'),
(5, 'Anaya', 'Nair', 'Kochi', 'Kerala', 'anaya.nair@example.com', '2024-05-20'),
(6, 'Kabir', 'Singh', 'Chandigarh', 'Chandigarh', 'kabir.singh@example.com', '2024-06-18'),
(7, 'Meera', 'Das', 'Bhubaneswar', 'Odisha', 'meera.das@example.com', '2024-07-09'),
(8, 'Reyansh', 'Mishra', 'Lucknow', 'Uttar Pradesh', 'reyansh.mishra@example.com', '2024-08-22'),
(9, 'Saanvi', 'Tripathi', 'Varanasi', 'Uttar Pradesh', 'saanvi.tripathi@example.com', '2024-09-30'),
(10, 'Advait', 'Chatterjee', 'Kolkata', 'West Bengal', 'advait.chatterjee@example.com', '2024-10-10'),
(11, 'Myra', 'Roy', 'Kolkata', 'West Bengal', 'myra.roy@example.com', '2024-11-11'),
(12, 'Arjun', 'Rao', 'Bangalore', 'Karnataka', 'arjun.rao@example.com', '2024-12-01');

How Clustering Helps in Query Performance?

Example Query:

SELECT first_name, last_name, email
FROM `my_project.my_dataset.customers_clustered`
WHERE city = 'Kolkata' AND state = 'West Bengal';

Since clustering is done on city and state, BigQuery scans fewer blocks, making the query cheaper and faster.

 


How can I get clustering information for a specific table?

Run the following query to get clustering information for a specific table. 

SELECT 
*
FROM 
  `my_project.my_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE 
  clustering_ordinal_position IS NOT NULL
  AND table_name = 'your_table_name';

For customers_clustered table, I received following information.

 


In Summary,

·      Use clustering when your queries frequently filter or group by specific columns.

·      Clustering works best for read-heavy tables with append-only workloads.

·      Unlike partitioning, clustering doesn't restrict you to date-based columns.

·      Combine with partitioning for maximum optimization on large datasets.

 

  

Previous                                                    Next                                                    Home

No comments:

Post a Comment