Saturday, 28 June 2025

Best Practices in BigQuery to save cost and improve performance

 

When working with BigQuery, performance and cost optimization go hand in hand. Understanding how queries are executed under the hood allows you to design more efficient, cost-effective data workflows.

 

This post explains key best practices to help you write better queries, minimize costs, and make the most out of BigQuery’s powerful architecture.

 

1. Techniques to Reduce Data Scanned

Reducing the data read directly reduces query costs. Here's how:

 

1.1 Avoid SELECT * statements

BigQuery charges based on the number of bytes processed — and every column you read costs money. That means the more data you scan, the more you pay. Even if you don’t need all the columns, using SELECT * forces BigQuery to scan everything.

 

How Columnar Storage Works?

BigQuery is built on columnar storage, which means data is stored by column, not by row.

 

If you query only 2 columns from a 50-column table, BigQuery reads only those 2 columns from disk. This is extremely efficient compared to row-based storage systems (like traditional relational databases), where reading any part of a row requires reading the whole row.

 

So, the fewer columns you read, the less data BigQuery needs to scan, and the less you pay.

 

Let’s say you have:

 

·      Table schema: 50 columns

·      Row size: 50 KB

·      Table Size: 10,000 rows

 

Case 1: SELECT *

·      Each row is 50 KB

·      10,000 rows × 50 KB = 500 MB scanned

 

Case 2: SELECT column1, column2

·      Assume column1 and column2 together total just 2 KB per row

·      10,000 rows × 2 KB = 20, 000KB = 20 MB scanned

 

Avoid SELECT * in production queries. Always select only the columns you need.

 

1.2 Use partitioning or clustering of table wherever possible.

BigQuery is a serverless, columnar data warehouse that charges you based on how much data you scan, not how many rows you return. So, if you query a huge table without optimization, it could scan GBs or even TBs of data, costing more and running slower.

 

Partitioning is breaking a table into segments (called partitions) based on the value of a column, most commonly a date column.

 

For example, let's say you have a user_logs table with millions of rows.

CREATE TABLE my_dataset.user_logs (
  user_id STRING,
  activity STRING,
  log_date DATE
)
PARTITION BY log_date;

Now, if you write this query:

SELECT * FROM my_dataset.user_logs
WHERE log_date = '2025-04-01';

BigQuery will only scan the partition for April 1, 2025, not the entire table. That’s much faster and cheaper.

 

Clustering sorts your table within each partition based on certain columns. It helps BigQuery to locate the exact rows faster.

 

For example,  You can cluster by one or more columns:

CREATE TABLE my_dataset.user_logs (
  user_id STRING,
  activity STRING,
  log_date DATE
)
PARTITION BY log_date
CLUSTER BY user_id;

 

Get me the all the logs for a given user for April 1st 2025

SELECT * FROM my_dataset.user_logs
WHERE log_date = '2025-04-01' AND user_id = 'user_123';

Now, queries like above are super-efficient BigQuery reads only:

 

·      The partition for the date.

·      The rows clustered around user_123.

 

1.3 Flatten or Denormalize your schema by combining related data into a single table and utilize nested and repeated fields. BigQuery delivers optimal performance when working with denormalized data structures.

In traditional relational databases (like MySQL or PostgreSQL), you’re taught to normalize data, split it into multiple related tables to reduce redundancy.

 

But in BigQuery, it is completely different

 

·      Joins are expensive.

·      Reading lots of tables = scanning lots of data.

·      Nested data is first-class in BigQuery, so it's better to keep related data together.

 

Denormalization means, instead of splitting data into separate tables with foreign keys, you combine them into a single table with nested or repeated fields.

 

BigQuery supports STRUCTs (nested objects) and ARRAYs (repeated objects) inside a row. That’s powerful, because now a single row can contain complex, related data.

 

Example: Normalized vs Denormalized

a. Traditional (Normalized)

 

customers table

customer_id

name   

1

Ram

2

Krishna

 

orders table

order_id

customer_id

amount

101

1

50

102

1

30

103

2

70

 

To get all orders for each customer, you’d need a JOIN:

SELECT c.name, o.amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

 

b. BigQuery (Denormalized with Nested Fields)

You’d instead have one table like this:

{
  "customer_id": 1,
  "name": "Ram",
  "orders": [
    {"order_id": 101, "amount": 50},
    {"order_id": 102, "amount": 30}
  ]
}

 

Schema in BigQuery:

CREATE TABLE my_dataset.customers (
  customer_id INT64,
  name STRING,
  orders ARRAY<STRUCT<order_id INT64, amount FLOAT64>>
);

Now, you can query like this:

SELECT name, o.amount
FROM my_dataset.customers, UNNEST(orders) AS o;

 

The cost of JOINS in BigQuery

Let’s say you have:

 

·      A customers table: 5 GB

·      A customer_orders table: 100 GB

 

When you run following query:

SELECT c.customer_id, c.name, o.order_id, o.amount
FROM customers c
JOIN customer_orders o ON c.customer_id = o.customer_id;

What happens under the hood?

·      Data Shuffling: BigQuery must bring all rows with the same customer_id together across both tables. This requires massive data movement across the network, called a shuffle phase.

 

·      Memory Usage: Each node must buffer part of the join keys in memory, especially from the smaller table (5GB customers). If the customers table doesn’t fit well in memory on all join nodes, it could spill to disk, slowing down processing.

 

·      Cross-Worker Communication: Since different partitions of each table are on different workers, the join causes cross-node communication, which is expensive and time-consuming.

 

Instead of joining:

·      You could denormalize customer_orders by embedding orders info as nested fields.

·      Querying becomes local to each record, with no data movement or shuffle.

 

In distributed systems like BigQuery, joins cause data movement. The more data you shuffle, the more time and cost your query will take.

 

1.4 Create materialized views to store aggregated data. Querying these views reduces the amount of data scanned, as the data is already pre-aggregated.

A materialized view is a precomputed and stored result of a query, especially useful for aggregations. Unlike a standard (logical) view that runs the query every time you query it, a materialized view stores the results and keeps them up to date automatically.

 

Suppose, you have a sales table:

date 

region

product

amount

2024-01-01 

West 

A

100

2024-01-01 

West 

B

200

 

Suppose the size of the table is 7 TB, you frequently run this aggregation:

SELECT region, product, SUM(amount) AS total_sales
FROM sales
GROUP BY region, product;

 

Every time this runs, BigQuery scans total 7 TB.

 

Instead of this, Let's create a Materialized View

 

CREATE MATERIALIZED VIEW project.dataset.mv_sales_summary AS
SELECT region, product, SUM(amount) AS total_sales
FROM project.dataset.sales
GROUP BY region, product;

 

Then just:

SELECT *
FROM project.dataset.mv_sales_summary
WHERE region = 'West';

Now:

 

·      Querying this view scans only the small pre-aggregated result.

·      BigQuery auto-refreshes it as new data comes in.

·      You save both time and cost.

 

Another big advantage is, instead of recomputing the entire view, BigQuery applies only the changes (deltas), and make updates faster and more efficient. This helps to maintain performance and reduces cost compared to fully reprocessing the data.

 

1.5 Avoid using external data sources when performance is a high priority.

BigQuery allows querying external data sources like:

 

·      Google Sheets

·      Cloud Storage files (CSV, JSON, Parquet, etc.)

·      Bigtable

·      Cloud SQL etc.,

 

These are convenient for quick access or ad-hoc analysis, but they don’t offer the same performance as native BigQuery tables.

 

Why are external sources slower?

·      No native storage: Data is not stored inside BigQuery and it's fetched from somewhere else each time when you run query.

·      No columnar storage optimizations: External sources don’t benefit from BigQuery’s fast, column-based storage engine.

·      No partitioning or clustering: You can’t partition or cluster external sources for optimized filtering and scanning.

·      Limited caching: BigQuery can’t cache external data efficiently, so every query can be as slow as the first.

·      Extra latency: Reading from remote files or APIs introduces I/O bottlenecks.

 

In summary, External data sources are great for flexibility and quick access but come with performance trade-offs. If speed and efficiency are critical, it's best to ingest the data into BigQuery tables for optimal performance.

 

1.6 Reduce Shuffling Size Before JOINs

In BigQuery, JOINs are shuffle-intensive operations.

 

That means:

·      The data from join tables needs to be repartitioned (shuffled) across many nodes based on the join key.

·      This shuffle step is expensive, in terms of time, memory, and cost.

 

Let’s take an example

Large JOIN with No Filtering

This query will scan the entire orders table, even if most rows are not needed.

SELECT
  c.id AS customer_id,
  c.name AS customer_name,
  o.order_id,
  o.order_date,
  o.total_amount
FROM customers c
JOIN orders o
  ON c.id = o.customer_id;

 

If orders have 100 million rows, BigQuery must shuffle all of them to do the join, even if you're only interested in recent orders, you're paying for everything.

 

Filter Before Joining

Here, only recent orders (after 2024-01-01) are considered before the join.

WITH filtered_orders AS (
  SELECT *
  FROM orders
  WHERE order_date >= '2024-01-01'
)
SELECT
  c.id AS customer_id,
  c.name AS customer_name,
  fo.order_id,
  fo.order_date,
  fo.total_amount
FROM customers c
JOIN filtered_orders fo
  ON c.id = fo.customer_id;

 

In this example, now only relevant (smaller) data from orders is shuffled. Faster execution, less memory used, and lower cost.

 

2. Techniques to Reduce CPU Time in BigQuery

BigQuery charges based on data it storedm and data it is processed, and complex queries can:

 

·      Use more CPU and memory

·      Take longer to run

·      Cost more money especially at scale

 

So, optimizing queries helps to save both time and cost.

 

2.1 Use Approximate functions whenever possible

BigQuery provides approximate aggregate functions that sacrifice a little accuracy for much better performance.

 

How They Work?

Instead of scanning all rows to calculate an exact result, approximate functions use probabilistic algorithms that:

 

·      Sample the data smartly

·      Provide a statistically accurate estimate

·      Run faster with lower resource usage

 

Example: APPROX_COUNT_DISTINCT()

For example, if you want to find out how many unique users visited your site.

SELECT COUNT(DISTINCT user_id) FROM website_visits;

 

This is accurate but expensive because it needs to:

 

·      Track every unique user_id

·      Store them in memory

·      Deduplicate them

 

Approximate version:

SELECT APPROX_COUNT_DISTINCT(user_id) FROM website_visits;

 

·      Much faster

·      Less expensive

·      May differ slightly from the exact count, but very close

 

For example, instead of 1,000,000 it might say 999,850.

 

Other Useful Approximate Functions in BigQuery

Function

What it does

APPROX_COUNT_DISTINCT()

Counts approx. number of unique values.

APPROX_QUANTILES()

Returns approx. percentiles (e.g. median)

APPROX_TOP_COUNT()

Finds top values by frequency (approximate count)

APPROX_TOP_SUM()

Finds top values by total of another column

 

When Should You Use Them?

Use Case

Approx Ok?

Why?

Dashboards

Ok

Fast, close enough for charts

Reports with trends

Ok

Trends don’t need exact counts

Financial audits

Not Ok

Accuracy is critical

Billing or pricing

Not Ok

Must be exact

 

 

In summary, if your query is slow or expensive and you don't need 100% exact numbers, start replacing COUNT(DISTINCT) with APPROX_COUNT_DISTINCT() and see the difference in speed.

 

2.2 Apply ORDER BY and other complex operations in the outermost query, as the data has usually been filtered by then. Additionally, whenever possible, combine ORDER BY with LIMIT to improve performance.    

        

When you write SQL in BigQuery:

 

·      Your query might have nested subqueries, joins, filters, etc.

·      Each step processes some portion of the data.

·      The deeper (earlier) the operation, the more raw data it touches.

·      So, complex operations like sorting (ORDER BY) are cheaper when applied after filtering, not before.

        

Let’s say we want the top 10 most active users from India.

 

Bad Approach (ORDER BY before filtering)

SELECT *
FROM (
  SELECT user_id, COUNT(*) AS visit_count
  FROM `project.dataset.website_visits`
  GROUP BY user_id
  ORDER BY visit_count DESC
)
WHERE user_id IN (
  SELECT user_id
  FROM `project.dataset.user_profiles`
  WHERE country = 'India'
)
LIMIT 10;

What's Wrong Here?

·      You are sorting all users (even non-Indian ones) before filtering.

·      BigQuery processes the entire dataset — costly!

·      Wasted effort on data you’ll throw away later.

Better Approach (Filter first, then ORDER BY):

SELECT user_id, COUNT(*) AS visit_count
FROM `project.dataset.website_visits` v
JOIN `project.dataset.user_profiles` u
ON v.user_id = u.user_id
WHERE u.country = 'India'
GROUP BY user_id
ORDER BY visit_count DESC
LIMIT 10;

 

Why This Is Better?

·      Filters out non-Indian users first

·      Groups and counts visits only for relevant data

·      Orders only the already filtered small result set

·      Uses LIMIT 10 to stop early

·      BigQuery can optimize it with less CPU, less memory, less cost

 

2.3 Store the transformed data in a separate table and perform any additional operations on this new table

Instead of doing everything in one long query with many joins, filters, calculations, and aggregations:

 

·      First, write intermediate or transformed results to a new table.

·      Then, run further queries (like sorting, joining, or aggregating) on that smaller, processed table.

·      This is often referred to as staging or using intermediate tables.

 

 

2.4 Prefer SQL User-Defined Functions (UDFs) over JavaScript UDFs.

JavaScript UDFs require the creation of a separate runtime environment, which can slow down query performance due to the overhead of starting a subprocess. In contrast, SQL UDFs run natively within BigQuery's execution engine and are more efficient.

 

Let’s say you frequently need to extract the domain from an email address.

 

Example of JavaScript UDF

CREATE TEMP FUNCTION extract_domain_js(email STRING)
RETURNS STRING
LANGUAGE js AS """
  return email.split('@')[1];
""";

 

We can do the same using native UDFS

 

Create the SQL UDF:

CREATE TEMP FUNCTION extract_domain(email STRING)
RETURNS STRING
AS (
  SPLIT(email, '@')[OFFSET(1)]
);

 

Use it in a query:

SELECT email, extract_domain(email) AS domain
FROM UNNEST(['john@example.com', 'alice@google.com']) AS email;

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment