Tuesday, 3 June 2025

Efficient Schema Design in BigQuery

BigQuery is a fully managed, serverless data warehouse that performs best when the schema is designed for efficiency. Unlike traditional relational databases, which favor normalized schemas (like star and snowflake schemas), BigQuery benefits from denormalized structures. This allows for faster query execution by reducing the need for costly JOIN operations.

 

This guide will introduce normalization vs. denormalization with simple examples, explain why denormalization is recommended in BigQuery, and discuss the benefits of nested and repeated columns.

 

1. Normalization vs. Denormalization

Before understanding why BigQuery prefers denormalization, let's first clarify what normalization and denormalization mean.

 

Example of a Normalized Schema (Relational Database Approach)

Consider an e-commerce dataset with information about orders and customers. A normalized schema would split the data into separate tables to eliminate redundancy:

 

Customers Table

customer_id

name

email

1

Ram

ram@email.com

2

Krishna

krishna@email.com

 

Orders Table

order_id

customer_id

product

quantity

101

1

Laptop

1

102

2

Phone

2

103

1

Mouse

1

 

To get all orders with customer details, we need to perform a JOIN between the Customers and Orders tables.

SELECT o.order_id, c.name, c.email, o.product, o.quantity
FROM Orders o
JOIN Customers c
ON o.customer_id = c.customer_id;

This approach is optimized for data consistency but can slow down queries due to the JOIN operation.

 

Example of a Denormalized Schema (BigQuery Approach)

BigQuery performs best when data is denormalized, meaning all necessary information is stored in a single table.

 

order_id

customer_id

name

email

product

quantity

101

1

Ram

ram@email.com

Laptop

1

102

2

Krishna

krishna@email.com

Phone

2

103

1

Ram

ram@email.com

Mouse

1

 

Since there are no joins, queries are much faster. However, this can lead to data duplication, which increases storage costs. BigQuery optimizes for performance, so the trade-off is often worth it.

 

2. Why BigQuery Recommends Denormalization

2.1 Distributed Processing: BigQuery is designed to scan large datasets efficiently in a distributed manner. Avoiding JOIN operations helps distribute queries more effectively.

 

2.2 Better Query Performance: Query execution is significantly faster because there's no need to fetch and combine data from multiple tables.

 

2.3 Reduced Query Complexity: Simplifies SQL queries since all relevant information is in one place.

 

3. Queries like ORDER BY or DISTINCT require shuffling

However, some queries like ORDER BY or DISTINCT may still require shuffling (data redistribution across processing nodes), which can impact performance.

 

Consider the following denormalized table named sales_data

 

order_id

customer_name

product

price

order_date

101

Ram

Laptop

1200

2024-03-01 10:00:00

102

Krishna

Phone

800

2024-03-02 11:30:00

103

Ram

Mouse

40

2024-03-01 12:45:00

 

If we run:

SELECT order_id, customer_name, product, price, order_date
FROM sales_data
ORDER BY order_date;

 

Why This Causes Shuffling?

·      BigQuery distributes data across multiple nodes for parallel processing.

·      Since ordering requires all data to be sorted together, BigQuery redistributes data across nodes before applying the ORDER BY.

·      This increases query execution time, especially for large datasets.

 

3.1 DISTINCT Causing Shuffling

Suppose we want a list of unique customers:

SELECT DISTINCT customer_name FROM sales_data;

 

Why This Causes Shuffling?

·      Different nodes may process different parts of the dataset.

·      Since the same customer name might appear in multiple partitions, BigQuery needs to move data across nodes to ensure uniqueness.

·      This adds network overhead and increases query time.

 

3.2 How to reduce Shuffling?

3.2. 1Use Partitioning & Clustering

If the table is partitioned by order_date, queries with ORDER BY order_date can process only relevant partitions, reducing shuffle.

 

Clustering by customer_name helps DISTINCT queries since duplicate names are stored closer together within the partition.

CREATE TABLE sales_data (
    order_id INT64,
    customer_name STRING,
    product STRING,
    price FLOAT64,
    order_date TIMESTAMP
)
PARTITION BY DATE(order_date)
CLUSTER BY customer_name;

 

Partitioning divides a table into segments (partitions) based on a column, so queries only scan relevant partitions instead of the entire table.

 

Clustering organizes data within each partition based on one or more columns, storing similar values together.

 

3.3 Avoid Sorting Unless Necessary

If ORDER BY is needed, limit the number of rows:

SELECT order_id, customer_name, product, price, order_date
FROM sales_data
ORDER BY order_date
LIMIT 100;

3.3 Use Approximate Aggregation for Large DISTINCT Queries

Instead of DISTINCT, use APPROX_COUNT_DISTINCT():

SELECT APPROX_COUNT_DISTINCT(customer_name) FROM sales_data;

4. Using Nested and Repeated Columns Instead of Joins

Instead of fully denormalizing with duplicate data, BigQuery allows nested and repeated columns to maintain relationships efficiently.

 

Example: Using Nested & Repeated Fields

Instead of storing customer orders in a separate table, we can store an array of orders within each customer record.

 

{
  "customer_id": 1,
  "name": "Ram",
  "email": "ram@email.com",
  "orders": [
    {"order_id": 101, "product": "Laptop", "quantity": 1},
    {"order_id": 103, "product": "Mouse", "quantity": 1}
  ]
}

 

In BigQuery Table Format:

customer_id

name 

email

orders (REPEATED STRUCT)

1

Ram

ram@email.com

[{"order_id":101, "product":"Laptop", "quantity":1}, {"order_id":103, "product":"Mouse", "quantity":1}]

2

Krishna

krishna@email.com

[{"order_id":102, "product":"Phone", "quantity":2}]

 

Let’s create a json document for the above example and upload the same to BigQuery and experiment.

 

customerOrders.json

 

{"id": 1, "name": "Ram", "email": "ram@email.com", "orders": [{"order_id": 101, "product": "Laptop", "quantity": 1}, {"order_id": 103, "product": "Mouse", "quantity": 1}]}
{"id": 2, "name": "Krishna", "email": "krishna@email.com", "orders": [{"order_id": 102, "product": "Phone", "quantity": 2}]}

BigQuery expects newline-delimited JSON (NDJSON), meaning each JSON object should be on a separate line (not wrapped in an array []).

 

 

 

Follow below step-by-step procedure to upload customerOrders.json

file.

 

Step 1: Open BigQuery Explorer

·      Open the Google Cloud Console.

·      Navigate to BigQuery by selecting BigQuery from the menu or using the search bar.

·      In the BigQuery Explorer, locate your project and dataset.

 

Step 2: Click on the Dataset

In the BigQuery Explorer, expand your project and click on the dataset where you want to upload the employees.csv file. You will be taken to the Dataset Information page. 


 

 

Step 3: Click on 'CREATE TABLE'

On the dataset information page, click on the Create Table button to start the process of creating a new table.

 

Step 4: Fill in the Table Creation Form

Once you click Create Table, a form will open where you can specify details about the table you're creating.

 

Step 5: Choose 'Upload' as the Source

Under Source, select the Upload option.

 


Click on the Browse button and select your customerOrders.json

file from your local system.

 

Step 6: Provide the Table Name

·      In the Table Name field, enter the name of the table, for example: customer_orders.

·      This will be the name of the table in BigQuery that will hold the uploaded data.

 

Step 7: Under Schema section, click on ‘ADD FIELD’ button to define the schema explicitly.

 


Click on ‘ADD FIELD’ button and start defining the properties

id : INTEGER

name: STRING

email: STRING 


 

Let’s define orders as a RECORD. As these are list of orders associcated with a customer, we need to model the Mode as REPEATED. See the below screenshot for more details.  


 

Now click on ADD FIELD button available below orders to define the properties of orders object.

 

Final Schema looks like below. 


 

Click on CREATE TABLE button to proceed with table creation.

 

 

Upon successful table creation, you can see preview like below.

 


Schema is given below.

 


Step 8: Querying table with nested data (RECORD type)

To extract the details, use UNNEST() function. 

SELECT c.name, o.order_id, o.product, o.quantity
FROM i-mariner-453509-e9.test_dataset.customer_orders c, UNNEST(c.orders) o;

 

·      Project: i-mariner-453509-e9

·      dataset: test_dataset

·      table: customer_orders

 

You can see the response like below.

 


Understanding UNNEST function

The UNNEST() function in BigQuery expands (flattens) arrays stored inside a column so that each element becomes its own row.

 

Before unnested data looks like below

 

customer_id

name 

email

orders (REPEATED STRUCT)

1

Ram

ram@email.com

[{"order_id":101, "product":"Laptop", "quantity":1}, {"order_id":103, "product":"Mouse", "quantity":1}]

2

Krishna

krishna@email.com

[{"order_id":102, "product":"Phone", "quantity":2}]

 

After unnested(flattened)

 

id

name

email

order_id

product

quantity

1

Ram

ram@email.com

101

Laptop

1

1

Ram

ram@email.com

103

Mouse

1

2

Krishna

krishna@email.com

102

Phone1

2

 

UNNEST(c.orders) split each array inside the orders column into separate rows. Now each order has its own row, making it easier to query.

 

In summary,

·      Prefer denormalization for better query performance, but be mindful of storage trade-offs.

·      Use nested and repeated fields to maintain relationships efficiently without expensive joins.

·      Optimize ORDER BY and DISTINCT operations as they may require data shuffling.

·      Partition and cluster tables to improve performance further.

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment