In BigQuery, handling datasets with multiple partitioned or sharded tables can be cumbersome if each table needs to be queried individually. Wildcard tables allow querying multiple tables at once using a concise SQL statement. This feature is particularly useful when dealing with time-series data or logs that are stored in different tables following a consistent naming pattern.
1. What are Wildcard Tables?
Wildcard tables in BigQuery let you query multiple tables that share a common schema and a similar naming convention. Instead of writing multiple UNION ALL statements, you can use a wildcard expression to efficiently fetch data from all matching tables.
Schema Requirements
· All matching tables should have the same schema or a compatible schema.
· If schemas differ, queries might fail or require explicit column selection to avoid mismatches.
Example Use Case
Suppose you have daily log tables in the dataset my_project.logs_dataset, named as follows:
· app_logs_20240301
· app_logs_20240302
· app_logs_20240303
Instead of querying each table individually, you can use a wildcard:
SELECT event_time, user_id, event_type FROM `my_project.logs_dataset.app_logs_*`
Benefits of Wildcard Tables
· Simplifies Queries: No need to manually list and UNION ALL multiple tables.
· Efficient Execution: BigQuery processes only the required tables based on _TABLE_SUFFIX.
· Scalability: Ideal for querying large datasets partitioned by time, improving query manageability.
Sample DDL and DML statements for you to experiment
-- Creating table for March 1st, 2024 CREATE TABLE `my_project.logs_dataset.app_logs_20240301` ( event_time TIMESTAMP, user_id STRING, event_type STRING ); -- Creating table for March 2nd, 2024 CREATE TABLE `my_project.logs_dataset.app_logs_20240302` ( event_time TIMESTAMP, user_id STRING, event_type STRING ); -- Creating table for March 3rd, 2024 CREATE TABLE `my_project.logs_dataset.app_logs_20240303` ( event_time TIMESTAMP, user_id STRING, event_type STRING ); -- Inserting data into app_logs_20240301 INSERT INTO `my_project.logs_dataset.app_logs_20240301` (event_time, user_id, event_type) VALUES (TIMESTAMP '2024-03-01 10:00:00', 'user_001', 'login'), (TIMESTAMP '2024-03-01 10:05:00', 'user_002', 'purchase'), (TIMESTAMP '2024-03-01 10:10:00', 'user_003', 'logout'); -- Inserting data into app_logs_20240302 INSERT INTO `my_project.logs_dataset.app_logs_20240302` (event_time, user_id, event_type) VALUES (TIMESTAMP '2024-03-02 09:00:00', 'user_004', 'login'), (TIMESTAMP '2024-03-02 09:15:00', 'user_001', 'purchase'), (TIMESTAMP '2024-03-02 09:30:00', 'user_002', 'logout'); -- Inserting data into app_logs_20240303 INSERT INTO `my_project.logs_dataset.app_logs_20240303` (event_time, user_id, event_type) VALUES (TIMESTAMP '2024-03-03 08:30:00', 'user_005', 'login'), (TIMESTAMP '2024-03-03 08:45:00', 'user_002', 'purchase'), (TIMESTAMP '2024-03-03 09:00:00', 'user_004', 'logout');
2. Understanding _TABLE_SUFFIX in wildcard table names
_TABLE_SUFFIX is a special pseudo-column in BigQuery that represents the variable part of a wildcard table name. When you use a wildcard (*) in your query to match multiple tables, _TABLE_SUFFIX helps you filter and differentiate between them dynamically.
Querying Multiple Tables Using _TABLE_SUFFIX
Let's say you have the following tables in your dataset.
Table Name |
Data Stored |
app_logs_20240301 |
Logs for March 1, 2024 |
app_logs_20240302 |
Logs for March 2, 2024 |
app_logs_20240303 |
Logs for March 3, 2024 |
Basic Wildcard Query Without _TABLE_SUFFIX
If you run the following query:
SELECT event_time, user_id, event_type FROM `my_project.logs_dataset.app_logs_*`
It will fetch all data from
all tables that match app_logs_*. But, you won’t know which table each row came
from.
+----------------------------+--------+-----------+ | event_time | user_id | event_type | +----------------------------+--------+-----------+ | 2024-03-03 09:00:00.000000 | user_004 | logout | | 2024-03-03 08:45:00.000000 | user_002 | purchase | | 2024-03-03 08:30:00.000000 | user_005 | login | | 2024-03-01 10:10:00.000000 | user_003 | logout | | 2024-03-01 10:00:00.000000 | user_001 | login | | 2024-03-01 10:05:00.000000 | user_002 | purchase | | 2024-03-02 09:30:00.000000 | user_002 | logout | | 2024-03-02 09:00:00.000000 | user_004 | login | | 2024-03-02 09:15:00.000000 | user_001 | purchase | +----------------------------+--------+-----------+
Using _TABLE_SUFFIX to Identify the Source Table
To see which table each row belongs to, include _TABLE_SUFFIX in the SELECT statement:
SELECT event_time, user_id, event_type, _TABLE_SUFFIX FROM `my_project.logs_dataset.app_logs_*`
Sample Output
_TABLE_SUFFIX extracts the date part (20240301, 20240302, 20240303) from each table name.
+----------------------------+--------+-----------+-----------+ | event_time | user_id | event_type | _TABLE_SUFFIX | +----------------------------+--------+-----------+-----------+ | 2024-03-01 10:10:00.000000 | user_003 | logout | 20240301 | | 2024-03-01 10:00:00.000000 | user_001 | login | 20240301 | | 2024-03-01 10:05:00.000000 | user_002 | purchase | 20240301 | | 2024-03-02 09:30:00.000000 | user_002 | logout | 20240302 | | 2024-03-02 09:00:00.000000 | user_004 | login | 20240302 | | 2024-03-02 09:15:00.000000 | user_001 | purchase | 20240302 | | 2024-03-03 09:00:00.000000 | user_004 | logout | 20240303 | | 2024-03-03 08:45:00.000000 | user_002 | purchase | 20240303 | | 2024-03-03 08:30:00.000000 | user_005 | login | 20240303 | +----------------------------+--------+-----------+-----------+
Filtering Data for a Specific Date Range Using _TABLE_SUFFIX
Let's say you only want data from March 1st to March 2nd, 2024. Instead of specifying each table separately, use _TABLE_SUFFIX to filter:
SELECT event_time, user_id, event_type, _TABLE_SUFFIX FROM `my_project.logs_dataset.app_logs_*` WHERE _TABLE_SUFFIX IN UNNEST(['20240301', '20240302']);
+----------------------------+--------+-----------+-----------+ | event_time | user_id | event_type | _TABLE_SUFFIX | +----------------------------+--------+-----------+-----------+ | 2024-03-01 10:10:00.000000 | user_003 | logout | 20240301 | | 2024-03-01 10:00:00.000000 | user_001 | login | 20240301 | | 2024-03-01 10:05:00.000000 | user_002 | purchase | 20240301 | | 2024-03-02 09:30:00.000000 | user_002 | logout | 20240302 | | 2024-03-02 09:00:00.000000 | user_004 | login | 20240302 | | 2024-03-02 09:15:00.000000 | user_001 | purchase | 20240302 | +----------------------------+--------+-----------+-----------+
3. Limitations of Wildcard tables
3.1 Supports Native BigQuery Storage Only
· Wildcard tables do not support external tables (e.g., Google Sheets, Cloud Storage, Bigtable, etc.).
· Wildcard tables do not support views.
· If the wildcard expression matches a view, the query throws an error. For example, if your dataset contains both tables and views, and your query matches a view, it will fail.
3.2 Cached Results Are Not Supported
· Queries using wildcard tables do not use cached results.
· Each query fully scans the data, meaning you always get billed. Normally, if you run a query twice, BigQuery caches the results and does not charge you again. However, for wildcard tables, this does not happen.
3.3 DML Statements Cannot Target Wildcard Tables
· DML (Data Manipulation Language) statements like INSERT, UPDATE, DELETE, MERGE cannot be used with wildcard tables.
· You cannot modify multiple tables at once using a wildcard.
In summary,
· Wildcard tables in BigQuery provide a powerful way to query multiple tables dynamically without complex SQL statements. This is particularly useful for analyzing time-based data, logs, or partitioned datasets efficiently.
· _TABLE_SUFFIX is a pseudo-column that helps identify which table a row came from when using wildcard tables.
· Use _TABLE_SUFFIX to filter tables dynamically, instead of hardcoding table names.
· Use _TABLE_SUFFIX in GROUP BY or ORDER BY to analyze data across multiple tables efficiently.
· Wildcard tables combined with _TABLE_SUFFIX simplify querying sharded datasets, especially when working with time-based data (e.g., logs, events, metrics).
· Use Longer Prefixes for Better Performance: When querying wildcard tables, longer prefixes improve performance. Short prefixes result in scanning more tables, leading to higher query costs and slower execution.
No comments:
Post a Comment