BigQuery provides an auto schema detection feature that simplifies loading data by automatically inferring the schema from CSV and JSON files. This feature helps users to quickly import data without manually defining column types.
How BigQuery auto detect the Schema?
BigQuery automatically detects the schema by analyzing the first up to 100 rows of the file. It examines each field and assigns a data type based on sample values. Occasionally, BigQuery might modify field names to ensure compatibility with its SQL syntax.
Supported File Formats
· Auto schema detection is only available for CSV and JSON files.
· It supports GZip-compatible compression when opening files.
How Schema Detection Works
· Delimiters: Detects common delimiters such as comma (,), pipe (|), and tab (\t).
· Header Detection: If the first row contains only strings while other rows do not, BigQuery assumes it is a header row and ignores it.
· Date Format: Dates must follow the YYYY-MM-DD format. Otherwise, the column is treated as a string.
· Timestamp Formats: Recognizes multiple timestamp formats like:
o YYYY-MM-DD HH:MM:SS
o YYYY-MM-DD HH:MM:SS.sss
Limitations of Auto Schema Detection
· Schema inference is based only on sample data and may not always be accurate.
· If values in a column vary in format, the detected type may be inconsistent.
· It does not support schema detection for file formats other than CSV and JSON.
Best Practices
· Manually define schemas when data consistency is crucial.
· Preprocess data to match expected formats before loading it.
· Verify inferred schemas before using the dataset in queries.
Auto schema detection is a useful feature, but for production workloads, relying solely on it may lead to issues. Always validate your schema before loading large datasets into BigQuery.
Let's create a basic employee CSV file, upload it, and see how BigQuery detects the schema.
employees.csv
id,name,age,salary,hire_date,last_login,custom_date 1,Arjun Patel,30,55000.50,2022-05-15,2024-03-18 14:30:45,15/05/2022 2,Ananya Sharma,25,62000.75,2021-08-20,2024-03-18 09:15:30,20-08-2021 3,Ravi Kumar,40,75000.00,2019-11-10,2024-03-17 21:45:00,11.10.2019 4,Priya Reddy,35,68000.25,2020-07-05,2024-03-18 16:05:10,07/05/2020
Columns summary
· id: Integer
· name: String (Indian names)
· age: Integer
· salary: Float
· hire_date: Standard Date Format (YYYY-MM-DD)
· last_login: Timestamp Format (YYYY-MM-DD HH:MM:SS)
· custom_date: Non-Standard Date Format (DD/MM/YYYY, DD-MM-YYYY, DD.MM.YYYY)
Follow below step-by-step procedure to upload employees.csv file.
Step 1: Open BigQuery Explorer
· Open the Google Cloud Console.
· Navigate to BigQuery by selecting BigQuery from the menu or using the search bar.
· In the BigQuery Explorer, locate your project and dataset.
Step 2: Click on the Dataset
In the BigQuery Explorer, expand your project and click on the dataset where you want to upload the employees.csv file. You will be taken to the Dataset Information page.
Step 3: Click on 'CREATE TABLE'
On the dataset information page, click on the Create Table button to start the process of creating a new table.
Step 4: Fill in the Table Creation Form
Once you click Create Table, a form will open where you can specify details about the table you're creating.
Step 5: Choose 'Upload' as the Source
Under Source, select the Upload option.
Click on the Browse button and select your employees.csv file from your local system.
Step 6: Provide the Table Name
· In the Table Name field, enter the name of the table, for example: auto_detect_employees.
· This will be the name of the table in BigQuery that will hold the uploaded data.
Step 7: Auto-Detect Schema
In the Schema section, check the box next to Auto Detect. This allows BigQuery to automatically detect the schema based on the data in your CSV file.
BigQuery will infer the data types (e.g., string, integer, float, date, timestamp) by looking at the sample rows in the file.
Step 8: Click on 'Create Table'
After confirming all the details, click the Create Table button.
Step 9: Review the Table
After the table is created, you will be able to see it in your dataset. You can now run queries on the table and BigQuery will use the schema it auto-detected from the employees.csv file.
Since the custom_date column in employees.csv uses a non-standard date format (e.g., DD/MM/YYYY), BigQuery treat that column as a string rather than a date. It's good to verify the schema after the upload.
By following this procedure, you'll be able to upload the employees.csv file into BigQuery, and BigQuery will automatically infer the schema for your table based on the data in the file.
Previous Next Home
No comments:
Post a Comment