Monday, 19 May 2025

Efficiently Querying Multiple Tables with Wildcard Tables in BigQuery

In BigQuery, handling datasets with multiple partitioned or sharded tables can be cumbersome if each table needs to be queried individually. Wildcard tables allow querying multiple tables at once using a concise SQL statement. This feature is particularly useful when dealing with time-series data or logs that are stored in different tables following a consistent naming pattern.

1. What are Wildcard Tables?

Wildcard tables in BigQuery let you query multiple tables that share a common schema and a similar naming convention. Instead of writing multiple UNION ALL statements, you can use a wildcard expression to efficiently fetch data from all matching tables.

 

Schema Requirements

·      All matching tables should have the same schema or a compatible schema.

·      If schemas differ, queries might fail or require explicit column selection to avoid mismatches.

 

Example Use Case

Suppose you have daily log tables in the dataset my_project.logs_dataset, named as follows:

 

·      app_logs_20240301

·      app_logs_20240302

·      app_logs_20240303

 

Instead of querying each table individually, you can use a wildcard:

SELECT event_time, user_id, event_type
FROM `my_project.logs_dataset.app_logs_*`

Benefits of Wildcard Tables

·      Simplifies Queries: No need to manually list and UNION ALL multiple tables.

·      Efficient Execution: BigQuery processes only the required tables based on _TABLE_SUFFIX.

·      Scalability: Ideal for querying large datasets partitioned by time, improving query manageability.

 

Sample DDL and DML statements for you to experiment

-- Creating table for March 1st, 2024
CREATE TABLE `my_project.logs_dataset.app_logs_20240301` (
    event_time TIMESTAMP,
    user_id STRING,
    event_type STRING
);

-- Creating table for March 2nd, 2024
CREATE TABLE `my_project.logs_dataset.app_logs_20240302` (
    event_time TIMESTAMP,
    user_id STRING,
    event_type STRING
);

-- Creating table for March 3rd, 2024
CREATE TABLE `my_project.logs_dataset.app_logs_20240303` (
    event_time TIMESTAMP,
    user_id STRING,
    event_type STRING
);


-- Inserting data into app_logs_20240301
INSERT INTO `my_project.logs_dataset.app_logs_20240301` (event_time, user_id, event_type)
VALUES
    (TIMESTAMP '2024-03-01 10:00:00', 'user_001', 'login'),
    (TIMESTAMP '2024-03-01 10:05:00', 'user_002', 'purchase'),
    (TIMESTAMP '2024-03-01 10:10:00', 'user_003', 'logout');

-- Inserting data into app_logs_20240302
INSERT INTO `my_project.logs_dataset.app_logs_20240302` (event_time, user_id, event_type)
VALUES
    (TIMESTAMP '2024-03-02 09:00:00', 'user_004', 'login'),
    (TIMESTAMP '2024-03-02 09:15:00', 'user_001', 'purchase'),
    (TIMESTAMP '2024-03-02 09:30:00', 'user_002', 'logout');

-- Inserting data into app_logs_20240303
INSERT INTO `my_project.logs_dataset.app_logs_20240303` (event_time, user_id, event_type)
VALUES
    (TIMESTAMP '2024-03-03 08:30:00', 'user_005', 'login'),
    (TIMESTAMP '2024-03-03 08:45:00', 'user_002', 'purchase'),
    (TIMESTAMP '2024-03-03 09:00:00', 'user_004', 'logout');

 

2. Understanding _TABLE_SUFFIX in wildcard table names

_TABLE_SUFFIX is a special pseudo-column in BigQuery that represents the variable part of a wildcard table name. When you use a wildcard (*) in your query to match multiple tables, _TABLE_SUFFIX helps you filter and differentiate between them dynamically.

 

Querying Multiple Tables Using _TABLE_SUFFIX

Let's say you have the following tables in your dataset.

 

Table Name

Data Stored

app_logs_20240301

Logs for March 1, 2024

app_logs_20240302

Logs for March 2, 2024

app_logs_20240303

Logs for March 3, 2024

 

Basic Wildcard Query Without _TABLE_SUFFIX

If you run the following query:

 

SELECT event_time, user_id, event_type
FROM `my_project.logs_dataset.app_logs_*`

It will fetch all data from all tables that match app_logs_*. But, you won’t know which table each row came from.

 

+----------------------------+--------+-----------+
| event_time                 | user_id | event_type |
+----------------------------+--------+-----------+
| 2024-03-03 09:00:00.000000 | user_004 | logout    |
| 2024-03-03 08:45:00.000000 | user_002 | purchase  |
| 2024-03-03 08:30:00.000000 | user_005 | login     |
| 2024-03-01 10:10:00.000000 | user_003 | logout    |
| 2024-03-01 10:00:00.000000 | user_001 | login     |
| 2024-03-01 10:05:00.000000 | user_002 | purchase  |
| 2024-03-02 09:30:00.000000 | user_002 | logout    |
| 2024-03-02 09:00:00.000000 | user_004 | login     |
| 2024-03-02 09:15:00.000000 | user_001 | purchase  |
+----------------------------+--------+-----------+

Using _TABLE_SUFFIX to Identify the Source Table

To see which table each row belongs to, include _TABLE_SUFFIX in the SELECT statement:

SELECT event_time, user_id, event_type, _TABLE_SUFFIX
FROM `my_project.logs_dataset.app_logs_*`

Sample Output

_TABLE_SUFFIX extracts the date part (20240301, 20240302, 20240303) from each table name.

+----------------------------+--------+-----------+-----------+
| event_time                 | user_id | event_type | _TABLE_SUFFIX |
+----------------------------+--------+-----------+-----------+
| 2024-03-01 10:10:00.000000 | user_003 | logout    | 20240301  |
| 2024-03-01 10:00:00.000000 | user_001 | login     | 20240301  |
| 2024-03-01 10:05:00.000000 | user_002 | purchase  | 20240301  |
| 2024-03-02 09:30:00.000000 | user_002 | logout    | 20240302  |
| 2024-03-02 09:00:00.000000 | user_004 | login     | 20240302  |
| 2024-03-02 09:15:00.000000 | user_001 | purchase  | 20240302  |
| 2024-03-03 09:00:00.000000 | user_004 | logout    | 20240303  |
| 2024-03-03 08:45:00.000000 | user_002 | purchase  | 20240303  |
| 2024-03-03 08:30:00.000000 | user_005 | login     | 20240303  |
+----------------------------+--------+-----------+-----------+

Filtering Data for a Specific Date Range Using _TABLE_SUFFIX

Let's say you only want data from March 1st to March 2nd, 2024. Instead of specifying each table separately, use _TABLE_SUFFIX to filter:

SELECT event_time, user_id, event_type, _TABLE_SUFFIX
FROM `my_project.logs_dataset.app_logs_*`
WHERE _TABLE_SUFFIX IN UNNEST(['20240301', '20240302']);

+----------------------------+--------+-----------+-----------+
| event_time                 | user_id | event_type | _TABLE_SUFFIX |
+----------------------------+--------+-----------+-----------+
| 2024-03-01 10:10:00.000000 | user_003 | logout    | 20240301  |
| 2024-03-01 10:00:00.000000 | user_001 | login     | 20240301  |
| 2024-03-01 10:05:00.000000 | user_002 | purchase  | 20240301  |
| 2024-03-02 09:30:00.000000 | user_002 | logout    | 20240302  |
| 2024-03-02 09:00:00.000000 | user_004 | login     | 20240302  |
| 2024-03-02 09:15:00.000000 | user_001 | purchase  | 20240302  |
+----------------------------+--------+-----------+-----------+

 

3. Limitations of Wildcard tables

3.1 Supports Native BigQuery Storage Only

·      Wildcard tables do not support external tables (e.g., Google Sheets, Cloud Storage, Bigtable, etc.).

·      Wildcard tables do not support views.

·      If the wildcard expression matches a view, the query throws an error. For example, if your dataset contains both tables and views, and your query matches a view, it will fail.

 

3.2 Cached Results Are Not Supported

·      Queries using wildcard tables do not use cached results.

·      Each query fully scans the data, meaning you always get billed. Normally, if you run a query twice, BigQuery caches the results and does not charge you again. However, for wildcard tables, this does not happen.

 

3.3 DML Statements Cannot Target Wildcard Tables

·      DML (Data Manipulation Language) statements like INSERT, UPDATE, DELETE, MERGE cannot be used with wildcard tables.

·      You cannot modify multiple tables at once using a wildcard.

 

In summary,

·      Wildcard tables in BigQuery provide a powerful way to query multiple tables dynamically without complex SQL statements. This is particularly useful for analyzing time-based data, logs, or partitioned datasets efficiently.

·      _TABLE_SUFFIX is a pseudo-column that helps identify which table a row came from when using wildcard tables.

·      Use _TABLE_SUFFIX to filter tables dynamically, instead of hardcoding table names.

·      Use _TABLE_SUFFIX in GROUP BY or ORDER BY to analyze data across multiple tables efficiently.

·      Wildcard tables combined with _TABLE_SUFFIX simplify querying sharded datasets, especially when working with time-based data (e.g., logs, events, metrics).

·      Use Longer Prefixes for Better Performance: When querying wildcard tables, longer prefixes improve performance. Short prefixes result in scanning more tables, leading to higher query costs and slower execution.

 


Previous                                                    Next                                                    Home

No comments:

Post a Comment