Monday, 9 June 2025

Partitioning by TIMESTAMP Column in BigQuery

BigQuery only supports partitioning by DATE, INTEGER, or RANGE, but not directly by a TIMESTAMP column. If you want to create a table partition by a TIMESTAMP column, you need to cast it to DATE explicitly.

CREATE TABLE my_project.my_dataset.transactions_partitioned (
    transaction_id STRING,
    customer_name STRING,
    transaction_time TIMESTAMP,
    transaction_amount FLOAT64
)
PARTITION BY DATE(transaction_time);

 

·      BigQuery partitions TIMESTAMP values at the daily level (not hourly or minute-wise). Explicitly converting TIMESTAMP to DATE ensures that all rows from the same UTC day go into the same partition.

 

·      If we don't convert, we get an error because partitioning by TIMESTAMP is not directly supported.

 

You can confirm the same by going to the table details section.

 


Let’s insert some data into the table by executing following DML statement.  

INSERT INTO my_project.my_dataset.transactions_partitioned 
(transaction_id, customer_name, transaction_time, transaction_amount)
VALUES
    ('T1001', 'Aarav', TIMESTAMP '2024-03-15 10:30:45', 250.75),  -- Valid
    ('T1002', 'Bhavya', NULL, 120.00),                            -- NULL timestamp
    ('T1003', 'Charan', TIMESTAMP '9100-02-23 14:00:00', 310.50), -- Out-of-range
    ('T1004', 'Divya', TIMESTAMP '1990-05-10 08:15:30', 500.25),  -- Old but valid
    ('T1005', 'Esha', TIMESTAMP '2025-07-21 12:45:10', 600.40),   -- Future date
    ('T1006', 'Farhan', TIMESTAMP '2023-12-31 23:59:59', 220.00), -- Year-end transaction
    ('T1007', 'Gaurav', TIMESTAMP '2020-02-29 01:10:05', 430.10), -- Leap year date
    ('T1008', 'Hina', TIMESTAMP '2022-06-10 07:55:22', 350.25),   -- Valid
    ('T1009', 'Ishaan', TIMESTAMP '2000-01-01 00:00:01', 150.50), -- Old date
    ('T1010', 'Jiya', TIMESTAMP '9999-12-31 23:59:59', 890.75),   -- Extreme future date
    ('T1011', 'Kunal', NULL, 290.00),                             -- NULL timestamp
    ('T1012', 'Lavanya', TIMESTAMP '2030-08-15 06:30:00', 120.30),-- Future date
    ('T1013', 'Manoj', TIMESTAMP '2015-04-01 19:20:40', 400.00),  -- Old but recent
    ('T1014', 'Neha', TIMESTAMP '2026-11-05 09:45:10', 340.60),   -- Future date
    ('T1015', 'Omkar', TIMESTAMP '1750-12-10 03:33:33', 510.90),  -- Invalid (too old)
    ('T1016', 'Pooja', TIMESTAMP '2077-01-01 11:11:11', 600.10),  -- Future
    ('T1017', 'Qasim', TIMESTAMP '2024-04-20 04:55:00', 120.75),  -- Valid
    ('T1018', 'Riya', TIMESTAMP '2011-07-07 13:33:33', 250.00),   -- Old date
    ('T1019', 'Sohan', TIMESTAMP '9000-05-05 05:05:05', 999.99),  -- Out-of-range
    ('T1020', 'Tanvi', TIMESTAMP '2027-09-09 18:18:18', 875.25),  -- Future date
    ('T1021', 'Umesh', TIMESTAMP '2024-01-26 20:26:26', 450.60),  -- Republic Day India
    ('T1022', 'Vanya', TIMESTAMP '1995-08-24 22:22:22', 320.10),  -- Old date
    ('T1023', 'Wasim', TIMESTAMP '2050-10-31 23:59:59', 710.55),  -- Future Halloween
    ('T1024', 'Xavier', TIMESTAMP '1800-06-15 14:14:14', 230.20), -- Very old date
    ('T1025', 'Yash', NULL, 540.00),                              -- NULL timestamp
    ('T1026', 'Zoya', TIMESTAMP '2024-02-14 12:00:00', 660.90);   -- Valentine's Day

Let’s query ‘INFORMATION_SCHEMA.PARTITIONS’ for the table 'sales_partitioned' to confirm partitions.

SELECT 
    partition_id,
    total_rows
FROM my_project.my_dataset.INFORMATION_SCHEMA.PARTITIONS
WHERE table_name = 'transactions_partitioned';

 

 


Refer my previous post to know about __UNPARTITIONED__, __NULL__ (these are special partitions created by BigQuery). 

 

Optimizing BigQuery Performance with Integer Partitioning

Partitioning is a crucial optimization technique in BigQuery that helps to improve query performance and reduce costs. In this blog post, we will explore partitioning a table by an integer column, understand its benefits, and learn how to implement it with DDL and DML examples.

 

Why Use Integer Partitioning in BigQuery?

BigQuery allows partitioning by an INTEGER column, which is useful when data is organized based on numeric values like years, customer IDs, or categories. This method:

 

·      Improves query performance by scanning only relevant partitions.

·      Reduces query costs as fewer data blocks are read.

·      Speeds up data retrieval when filtering by the partitioned column.

 

Creating a Partitioned Table (DDL Example)

Let's create a table named customer_orders, partitioned by an integer column order_year.

 

CREATE OR REPLACE TABLE my_project.my_dataset.customer_orders (
    order_id INT64,
    customer_name STRING,
    product STRING,
    quantity INT64,
    order_year INT64
)
PARTITION BY RANGE_BUCKET(order_year, GENERATE_ARRAY(2020, 2080, 15));

When you're partitioning a table by an integer column (like order_year), you need to specify:

 

·      Start – the first value in your range

·      End – the limit up to where you want to go

·      Interval – how wide each bucket should be (like step size)

 

PARTITION BY RANGE_BUCKET(order_year, GENERATE_ARRAY(2020, 2080, 20))

Here

·      Start = 2020

·      End = 2080

·      Interval = 15

 

This generates buckets like:

 

·      Bucket 0: years from 2020 to 2034

·      Bucket 1: years from 2035 to 2049

·      Bucket 2: years from 2050 to 2064

·      Bucket 3: years from 2064 to 2079

 

You can confirm the partition detail in the table details section like below.


Let’s insert some data into customer_orders table.

 

INSERT INTO my_project.my_dataset.customer_orders (order_id, customer_name, product, quantity, order_year)
VALUES
    (1, 'Aarav', 'Laptop', 1, 2023),
    (2, 'Priya', 'Mobile Phone', 2, 2023),
    (3, 'Vikram', 'Headphones', 1, 2033),
    (4, 'Ananya', 'Tablet', 1, 2024),
    (5, 'Rahul', 'Smart Watch', 3, 2034),
    (6, 'Neha', 'Laptop', 1, 2024),
    (7, 'Kabir', 'Camera', 2, 2022),
    (8, 'Sanya', 'Speakers', 1, 2042),
    (9, 'Rohan', 'Smart TV', 1, 2021),
    (10, 'Ishita', 'Laptop', 2, 2051),
    (11, 'Manoj', 'Gaming Console', 1, 2022),
    (12, 'Pooja', 'Tablet', 3, 2023),
    (13, 'Arjun', 'Wireless Earbuds', 1, 2043),
    (14, 'Simran', 'Mobile Phone', 1, 2024),
    (15, 'Amit', 'Smart Watch', 2, 2062),
    (16, 'Nisha', 'Laptop', 1, 2021),
    (17, 'Ravi', 'Headphones', 2, 2021),
    (18, 'Meera', 'Smart TV', 1, 2023),
    (19, 'Dev', 'Speakers', 3, 2084),
    (20, 'Tanya', 'Gaming Console', 1, 2083),
    (21, 'Ram', 'Laptop', 1, 2076);

Let’s see the partitions allocation.

SELECT 
    partition_id,
    total_rows
FROM my_project.my_dataset.INFORMATION_SCHEMA.PARTITIONS
WHERE table_name = 'customer_orders';

Above query generate following result.

 


 

How the Data is Partitioned?

2020 partition (covers years 2020-2034):

·      Contains 14 rows (orders from 2021-2034)

·      Includes order years: 2021, 2022, 2023, 2024, 2033, 2034

 

2035 partition (covers years 2035-2049):

·      Contains 2 rows (orders from 2042, 2043)

 

2050 partition (covers years 2050-2064):

·      Contains 2 rows (orders from 2051, 2062)

 

2065 partition (covers years 2065-2079):

·      Contains 1 row (order from 2076)

 

__UNPARTITIONED__:

·      Contains 2 rows that fall outside your defined range (orders from 2083, 2084). These years are above your upper boundary of 2080

 

The two orders with years 2083 and 2084 are placed in the __UNPARTITIONED__ partition because:

 

·      Your range bucket goes up to 2080 (with the last bucket being 2065-2080)

·      Any years above 2080 don't fit into any defined bucket

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment