BigQuery is a powerful data warehouse solution on Google Cloud Platform (GCP), known for its speed, scalability, and simplicity. While many users load data into BigQuery tables before running queries, there's a lesser-known but incredibly useful feature — external data sources.
External data sources let you query data directly from external storage systems like Google Cloud Storage, Bigtable, Google Drive, and Cloud SQL without loading it into BigQuery. This can save time, reduce storage costs, and let you work more flexibly with large or frequently changing datasets.
In this post, we’ll explore:
· What external data sources are
· The two ways to work with external data in BigQuery
· Supported external sources
· How to query data from an external source
· Real examples and best practices
1. What Are External Data Sources?
An external data source is any data storage system outside of BigQuery that you can query directly without loading data into BigQuery tables.
Think of it as “borrowing” the data for querying rather than “owning” it in BigQuery storage.
Following table summarizes different External Sources supported in BigQuery
Source |
Description |
Google Cloud Storage |
Query CSV, JSON, Parquet, Avro, ORC files stored in GCS buckets. |
Google Drive |
Query spreadsheets (e.g., .xlsx, .csv) directly from Google Drive. |
Cloud Bigtable |
Query high-throughput NoSQL data from Bigtable using federated queries |
Cloud SQL |
Query relational data from MySQL, PostgreSQL hosted on Cloud SQL. |
2. How to query data from an external source?
Step 1: Upload some data to GCP Bucket.
Login to GCP cloud console (https://console.cloud.google.com/).
Search for storage.
Click on Buckets.
You will be taken to Buckets listing page.
Click on Create button to create new bucket.
You will be taken to ‘Create a bucket’ form.
Give the bucket name, for example, I given ‘demo_bucket_external_storage’. Click on Create button to create the bucket.
Upon successful creation of the bucket, you will be taken to bucket detail page.
Upload auditLog.csv to the bucket.
auditLog.csv
event_id,event_type,user_id,timestamp,resource,action,status 1,LOGIN,user_101,2025-04-05T08:00:00Z,/login,AUTH_SUCCESS,SUCCESS 2,UPDATE,user_102,2025-04-05T08:01:00Z,/profile,EDIT,SUCCESS 3,DELETE,user_103,2025-04-05T08:02:00Z,/account,REMOVE,FAILED 4,VIEW,user_104,2025-04-05T08:03:00Z,/dashboard,READ,SUCCESS 5,LOGOUT,user_105,2025-04-05T08:04:00Z,/logout,AUTH_LOGOUT,SUCCESS 6,LOGIN,user_106,2025-04-05T08:05:00Z,/login,AUTH_SUCCESS,SUCCESS 7,UPDATE,user_107,2025-04-05T08:06:00Z,/profile,EDIT,FAILED 8,VIEW,user_108,2025-04-05T08:07:00Z,/dashboard,READ,SUCCESS 9,DELETE,user_109,2025-04-05T08:08:00Z,/account,REMOVE,SUCCESS 10,LOGOUT,user_110,2025-04-05T08:09:00Z,/logout,AUTH_LOGOUT,SUCCESS
Click on Upload -> Upload files
Upload auditLog.csv file to the bucket.
Step 2: Let’s create a table using auditLog.csv file.
Navigate to BigQuery dataset.
Click on ‘Create Table’ button. It opens ‘Create table’ form
Select Source as ‘Google Cloud Storage’.
It enables an option to browse the file from GCS bucket.
Select the file from GCP bucket. You can even specify multiple files using wild card pattern.
Under Destination section, give the table name as audit_log.
You have two options to choose the table type.
a. Native table
b. External Table
Native Table vs External Table
Feature |
Native Tables |
External Tables |
Storage |
Stored inside BigQuery-managed storage. |
Stored outside BigQuery, in services like Cloud Storage, Bigtable, etc. |
Performance |
Optimized for performance (partitioning, clustering, caching). |
May have slower performance; depends on external source. |
Cost |
Charged for data storage and query processing. |
Only query processing is charged (no storage fees by BigQuery). |
Schema Management |
Schema is enforced within BigQuery |
Schema can be inferred or defined manually. |
Security |
Integrates with BigQuery IAM, supports fine-grained security. |
Limited or depends on the external source. |
Use Case |
Fully managed, high-performance analytics. |
Query external sources without loading data into BigQuery. |
Native Table FAQs
· Data is Copied: When you load data from a file in a GCS bucket into a native BigQuery table, the data is physically copied and stored within BigQuery's managed storage.
· No Automatic Refresh: Subsequent updates or changes to the original file in the GCS bucket will NOT automatically be reflected in the native BigQuery table.
· Vice Versa: Similarly, if you modify or delete data within the native BigQuery table, this will not affect the original file in the GCS bucket.
External Tables FAQs:
· Data Remains in GCS: When you create an external table in BigQuery that points to a file (or a set of files using wildcards) in a GCS bucket, the data itself remains in the GCS bucket. BigQuery only stores metadata about the data's location and schema.
· Automatic Refresh (on Query): When you query an external table, BigQuery reads the data directly from the files in the GCS bucket at that moment. Therefore:
o If the file in the GCS bucket is updated, the next time you query the external table, you will see the updated data. The "refresh" happens automatically with each query.
o If new files are added to the GCS bucket that match the URI pattern defined for the external table, those new files will also be included in subsequent queries.
· Vice Versa (Not Applicable): You cannot directly modify the underlying data in the GCS bucket by performing DML (Data Manipulation Language - INSERT, UPDATE, DELETE) operations on the external BigQuery table because external tables are generally read-only. To change the data, you must modify the source files in the GCS bucket.
Let’s choose the option ‘Native Table’ for the demo purpose.
Select the ‘Auto detect’ option under Schema section
Click on ‘Create’ button to create the table.
Upon successful creation of audit_log table, you will be navigated to audit_log schema page.
Click on Preview tab, to see data preview.
Create External table
Let’s follow the same procedure like Native table creation, except
a. Choose the table type as external
b. Table name as audit_log_external
Upon successful creation of audit_log_external table, you can observe that the preview option is not available for external tables.
Click on Query button to query the data.
SELECT * FROM `my_project.my_dataset.audit_log_external`
_FILE_NAME pseudo column in External table
_FILE_NAME is a pseudo column available when querying external tables (like CSV, JSON, Parquet, etc. stored in GCS). It tells you which specific file a given row came from.
Why is _FILE_NAME useful?
· Debugging: Identify which file contains bad/missing data.
· Auditing: Trace data lineage back to its source file.
· Filtering: Run queries only on specific files.
· Partition Simulation: Mimic partitioning behavior if file naming follows a date/user pattern.
SELECT *, _FILE_NAME FROM `my_project.my_dataset.audit_log_external`
Above snippet generate the external file details in the result.
3. Limitations When Working with External Data Sources in BigQuery
3.1 No Strong Consistency Guarantee: When querying data from external sources (like Google Cloud Storage), BigQuery reads directly from the original files. If those files are modified, added, or deleted during the query execution, the results may be incomplete, incorrect, or inconsistent.
For example: If you start a query while someone is uploading a new file or modifying an existing one, your results might not include the latest data — or might fail altogether.
For reliable results, avoid changing the source files while queries are running.
3.2 Slower Query Performance Compared to Native Tables: Since the data isn’t stored in BigQuery itself, it has to fetch and read the data from the external location each time a query runs. This adds overhead and can make queries slower, especially with large files or many files.
Unlike native tables that are optimized and indexed internally, external data lacks those performance boosts. Use external tables for exploratory or light queries. For repeated analysis, consider loading the data into a native table.
3.3 Export Jobs Aren’t Allowed
BigQuery lets you export data from native tables to places like Google Cloud Storage using EXPORT DATA. However, you can’t export data from an external table using a BigQuery export job.
Save the query result of the external table into a native table first, then export that table.
3.4 Limited Partitioning and Clustering: Partitioning and clustering are powerful BigQuery features that helps to organize and optimize data for performance. But external tables have very limited support for them.
You can’t partition an external table based on column values.
Clustering is not supported for most file-based external tables (e.g., CSV, Parquet).
If you need partitioning or clustering, load the external data into BigQuery.
3.5. No Query Caching: Normally, BigQuery caches the results of queries on native tables, which helps speed up repeated queries and saves costs. But when querying external data sources, caching is disabled. That means, each time you run a query on external data, it will read the files again and charge you for the full amount of data processed.
If you plan to run the same query multiple times, consider loading the data into BigQuery to take advantage of caching.
In sumamry, External tables are great for quick access to data stored outside BigQuery, especially when data changes often or is too big to load. But for performance, consistency, and advanced features, loading data into native BigQuery tables is usually a better long-term option.
Previous Next Home
No comments:
Post a Comment