Programming for beginners: How to Query External Data in BigQuery?

BigQuery is a powerful data warehouse solution on Google Cloud Platform (GCP), known for its speed, scalability, and simplicity. While many users load data into BigQuery tables before running queries, there's a lesser-known but incredibly useful feature — external data sources.

External data sources let you query data directly from external storage systems like Google Cloud Storage, Bigtable, Google Drive, and Cloud SQL without loading it into BigQuery. This can save time, reduce storage costs, and let you work more flexibly with large or frequently changing datasets.

In this post, we’ll explore:

· What external data sources are

· The two ways to work with external data in BigQuery

· Supported external sources

· How to query data from an external source

· Real examples and best practices

1. What Are External Data Sources?

An external data source is any data storage system outside of BigQuery that you can query directly without loading data into BigQuery tables.

Think of it as “borrowing” the data for querying rather than “owning” it in BigQuery storage.

Following table summarizes different External Sources supported in BigQuery

Source	Description
Google Cloud Storage	Query CSV, JSON, Parquet, Avro, ORC files stored in GCS buckets.
Google Drive	Query spreadsheets (e.g., .xlsx, .csv) directly from Google Drive.
Cloud Bigtable	Query high-throughput NoSQL data from Bigtable using federated queries
Cloud SQL	Query relational data from MySQL, PostgreSQL hosted on Cloud SQL.

2. How to query data from an external source?

Step 1: Upload some data to GCP Bucket.

Search for storage.

Click on Buckets.

You will be taken to Buckets listing page.

Click on Create button to create new bucket.

You will be taken to ‘Create a bucket’ form.

Give the bucket name, for example, I given ‘demo_bucket_external_storage’. Click on Create button to create the bucket.

Upon successful creation of the bucket, you will be taken to bucket detail page.

Upload auditLog.csv to the bucket.

auditLog.csv

event_id,event_type,user_id,timestamp,resource,action,status
1,LOGIN,user_101,2025-04-05T08:00:00Z,/login,AUTH_SUCCESS,SUCCESS
2,UPDATE,user_102,2025-04-05T08:01:00Z,/profile,EDIT,SUCCESS
3,DELETE,user_103,2025-04-05T08:02:00Z,/account,REMOVE,FAILED
4,VIEW,user_104,2025-04-05T08:03:00Z,/dashboard,READ,SUCCESS
5,LOGOUT,user_105,2025-04-05T08:04:00Z,/logout,AUTH_LOGOUT,SUCCESS
6,LOGIN,user_106,2025-04-05T08:05:00Z,/login,AUTH_SUCCESS,SUCCESS
7,UPDATE,user_107,2025-04-05T08:06:00Z,/profile,EDIT,FAILED
8,VIEW,user_108,2025-04-05T08:07:00Z,/dashboard,READ,SUCCESS
9,DELETE,user_109,2025-04-05T08:08:00Z,/account,REMOVE,SUCCESS
10,LOGOUT,user_110,2025-04-05T08:09:00Z,/logout,AUTH_LOGOUT,SUCCESS

Click on Upload -> Upload files

Upload auditLog.csv file to the bucket.

Step 2: Let’s create a table using auditLog.csv file.

Navigate to BigQuery dataset.

Click on ‘Create Table’ button. It opens ‘Create table’ form

Select Source as ‘Google Cloud Storage’.

It enables an option to browse the file from GCS bucket.

Select the file from GCP bucket. You can even specify multiple files using wild card pattern.

Under Destination section, give the table name as audit_log.

You have two options to choose the table type.

a. Native table

b. External Table

Native Table vs External Table

Feature	Native Tables	External Tables
Storage	Stored inside BigQuery-managed storage.	Stored outside BigQuery, in services like Cloud Storage, Bigtable, etc.
Performance	Optimized for performance (partitioning, clustering, caching).	May have slower performance; depends on external source.
Cost	Charged for data storage and query processing.	Only query processing is charged (no storage fees by BigQuery).
Schema Management	Schema is enforced within BigQuery	Schema can be inferred or defined manually.
Security	Integrates with BigQuery IAM, supports fine-grained security.	Limited or depends on the external source.
Use Case	Fully managed, high-performance analytics.	Query external sources without loading data into BigQuery.

Native Table FAQs

· Data is Copied: When you load data from a file in a GCS bucket into a native BigQuery table, the data is physically copied and stored within BigQuery's managed storage.

· No Automatic Refresh: Subsequent updates or changes to the original file in the GCS bucket will NOT automatically be reflected in the native BigQuery table.

· Vice Versa: Similarly, if you modify or delete data within the native BigQuery table, this will not affect the original file in the GCS bucket.

External Tables FAQs:

· Data Remains in GCS: When you create an external table in BigQuery that points to a file (or a set of files using wildcards) in a GCS bucket, the data itself remains in the GCS bucket. BigQuery only stores metadata about the data's location and schema.

· Automatic Refresh (on Query): When you query an external table, BigQuery reads the data directly from the files in the GCS bucket at that moment. Therefore:

o If the file in the GCS bucket is updated, the next time you query the external table, you will see the updated data. The "refresh" happens automatically with each query.

o If new files are added to the GCS bucket that match the URI pattern defined for the external table, those new files will also be included in subsequent queries.

· Vice Versa (Not Applicable): You cannot directly modify the underlying data in the GCS bucket by performing DML (Data Manipulation Language - INSERT, UPDATE, DELETE) operations on the external BigQuery table because external tables are generally read-only. To change the data, you must modify the source files in the GCS bucket.

Let’s choose the option ‘Native Table’ for the demo purpose.

Select the ‘Auto detect’ option under Schema section

Click on ‘Create’ button to create the table.

Upon successful creation of audit_log table, you will be navigated to audit_log schema page.

Click on Preview tab, to see data preview.

Create External table

Let’s follow the same procedure like Native table creation, except

a. Choose the table type as external

b. Table name as audit_log_external

Upon successful creation of audit_log_external table, you can observe that the preview option is not available for external tables.

Click on Query button to query the data.

SELECT * FROM `my_project.my_dataset.audit_log_external`

_FILE_NAME pseudo column in External table

_FILE_NAME is a pseudo column available when querying external tables (like CSV, JSON, Parquet, etc. stored in GCS). It tells you which specific file a given row came from.

Why is _FILE_NAME useful?

· Debugging: Identify which file contains bad/missing data.

· Auditing: Trace data lineage back to its source file.

· Filtering: Run queries only on specific files.

· Partition Simulation: Mimic partitioning behavior if file naming follows a date/user pattern.

SELECT *, _FILE_NAME FROM `my_project.my_dataset.audit_log_external`

Above snippet generate the external file details in the result.

3. Limitations When Working with External Data Sources in BigQuery

3.1 No Strong Consistency Guarantee: When querying data from external sources (like Google Cloud Storage), BigQuery reads directly from the original files. If those files are modified, added, or deleted during the query execution, the results may be incomplete, incorrect, or inconsistent.

For example: If you start a query while someone is uploading a new file or modifying an existing one, your results might not include the latest data — or might fail altogether.

For reliable results, avoid changing the source files while queries are running.

3.2 Slower Query Performance Compared to Native Tables: Since the data isn’t stored in BigQuery itself, it has to fetch and read the data from the external location each time a query runs. This adds overhead and can make queries slower, especially with large files or many files.

Unlike native tables that are optimized and indexed internally, external data lacks those performance boosts. Use external tables for exploratory or light queries. For repeated analysis, consider loading the data into a native table.

3.3 Export Jobs Aren’t Allowed

BigQuery lets you export data from native tables to places like Google Cloud Storage using EXPORT DATA. However, you can’t export data from an external table using a BigQuery export job.

Save the query result of the external table into a native table first, then export that table.

3.4 Limited Partitioning and Clustering: Partitioning and clustering are powerful BigQuery features that helps to organize and optimize data for performance. But external tables have very limited support for them.

You can’t partition an external table based on column values.

Clustering is not supported for most file-based external tables (e.g., CSV, Parquet).

If you need partitioning or clustering, load the external data into BigQuery.

3.5. No Query Caching: Normally, BigQuery caches the results of queries on native tables, which helps speed up repeated queries and saves costs. But when querying external data sources, caching is disabled. That means, each time you run a query on external data, it will read the files again and charge you for the full amount of data processed.

If you plan to run the same query multiple times, consider loading the data into BigQuery to take advantage of caching.

In sumamry, External tables are great for quick access to data stored outside BigQuery, especially when data changes often or is too big to load. But for performance, consistency, and advanced features, loading data into native BigQuery tables is usually a better long-term option.

Previous Next Home

Programming for beginners

Saturday, 14 June 2025

How to Query External Data in BigQuery?

No comments:

Post a Comment