Programming for beginners: Core Components of BigQuery

BigQuery is built on Google’s highly scalable infrastructure and consists of several key components that enable fast, efficient, and cost-effective data analytics. Below are the core components of BigQuery:

1. Colossus: Distributed Storage

BigQuery uses Colossus, Google’s powerful storage system, to store and manage huge amounts of data efficiently. Each Google data center has its own Colossus cluster, which means BigQuery can quickly access and process data without slowdowns. Colossus also ensures that data is automatically copied and backed up, so even if a disk fails, there is no data loss or downtime.

Unlike traditional databases that store data in rows, BigQuery uses a column-based storage format (ColumnIO), which makes reading and analyzing large datasets much faster and more efficient. This allows BigQuery to handle petabytes of data without needing expensive computing power.

In simple terms, Colossus makes BigQuery fast, reliable, and scalable—helping businesses to store and analyze massive amounts of data without worrying about managing hardware or handling failures.

Imagine we have a Customer Transactions Table with the following data:

Customer Id	Name	Age	Purchase_Amount	Country
101	Ram	30	200	USA
102	Krishna	35	190	India
103	Chamu	43	180	Canada
104	Sailu	23	450	UK

Row-Based Storage (Traditional Databases like MySQL, PostgreSQL)

Data is stored row by row, like this:

101,Ram,30,200,USA
102,Krishna,35,190,India
103,Chamu,43,180,Canada
104,Sailu,23,450,UK

If we want to find the average Purchase_Amount, we still have to scan the entire row for each record, even though we only need the "Purchase_Amount" column.

Column-Based Storage (BigQuery)

In column-based storage, data is stored column by column:

Customer_ID: 101, 102, 103, 104  
Name: Ram, Krishna, Chamu, Sailu  
Age: 30, 35, 43, 23  
Purchase_Amount: 200, 190, 180, 450  
Country: USA, India, Canada, UK

If we run a query like:

SELECT AVG(Purchase_Amount) FROM Transactions;

BigQuery only scans the "Purchase_Amount" column, ignoring other columns, making the query much faster and more efficient.

Columnar storage is best for analytics and large-scale queries where only a few columns are needed at a time. It significantly improves query speed, reduces costs, and enables efficient data compression, making it ideal for BigQuery and other modern data warehouses.

2. Jupiter: The High-Speed Network That Powers BigQuery

When working with Big Data, one of the biggest challenges is moving data quickly between machines. Even if a system has powerful processors and huge storage, slow network speeds can create bottlenecks, making queries take longer.

To solve this, Google built Jupiter, a high-speed networking system that allows BigQuery to move massive amounts of data extremely fast.

Jupiter is Google’s advanced network that connects thousands of computers inside Google’s data centers. It acts like an ultra-fast highway where data can travel at incredibly high speeds without congestion.

As per the Official Documentaiton, Jupiter Network can transfer 1 Petabit per second (1 Pbps).

1 Petabit = 1,000 Terabits (or 125 Terabytes per second).

This speed allows Google to handle huge workloads efficiently and distribute them across different machines. In short, Jupiter is the secret behind BigQuery’s speed, allowing it to handle petabytes of data effortlessly and making it one of the fastest cloud data warehouses available today.

3. Dremel: The Powerful Query Engine Behind BigQuery

Running a SQL query on huge datasets (billions of rows) is not just about having powerful hardware, it also requires an efficient query execution engine. This is where Dremel, BigQuery’s execution engine, comes

Dremel is the brain of BigQuery, it breaks your query into smaller tasks, distributes them across many machines, processes them in parallel, and then combines the results quickly.

Dremel organizes queries using a tree structure with three main components:

Component	Function	Analogy
Slots (Leaves of the Tree)	Read the data and do computations (like filtering, aggregations)	Think of them as workers in a factory doing small tasks in parallel.
Shuffle	Moves data quickly between different parts of the system	Imagine a mail processing facility, where Letters and packages arrive from different cities. The sorting system shuffles them based on their destination. After sorting, mail is sent to the correct delivery trucks for final distribution.
Mixers(Branches of the Tree)	Aggregate (combine) the results from slots	Like a manager who collects reports from different teams and combines them into a summary

The mixers and slots are managed by Borg, which distributes the available hardware resources as needed.

4. Borg: The System That Manages Google’s Huge Clusters

Imagine Google has a big data center with thousands of machines working together to process data. Now, managing all these machines manually would be impossible! That’s where Borg comes in, a powerful system that makes sure everything runs smoothly.

Borg controls and coordinates thousands of machines that work together on different tasks. It helps to run many applications at the same time without any slowdowns. Borg ensures that every computer is fully utilized and no machine is sitting idle.

Machines can crash, power can go out, or networks can fail, but Borg automatically adjusts to keep things running. If a server fails while your query is running, Borg moves the work to another server instantly, you won’t even notice!

When you run a BigQuery query, Borg finds thousands of CPU cores to process your data instantly. Even if your query uses 100 CPUs, that’s just a small fraction of Borg’s power. Since Borg manages a massive network of machines, BigQuery can run smoothly without delays.

In summary, BigQuery is a fully managed, serverless analytics data warehouse that enables users to run complex SQL queries on massive datasets with exceptional speed and ease. It abstracts away the complexities of infrastructure management, allowing users to focus on data analysis. By leveraging Google's powerful internal infrastructure, including Colossus and Dremel, BigQuery delivers consistent performance and scalability, regardless of data volume. Its core strength lies in its ability to provide a seamless and transparent analytics experience, making it accessible to a wide range of users.

References

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/

https://cloud.google.com/files/storage_architecture_and_challenges.pdf

https://cloudplatform.googleblog.com/2015/06/A-Look-Inside-Googles-Data-Center-Networks.html

https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood

Previous Next Home

Programming for beginners

Monday, 12 May 2025

Core Components of BigQuery

No comments:

Post a Comment