Programming for beginners: Introduction to Datasets in BigQuery

BigQuery is a powerful, serverless data warehouse from Google Cloud that enables users to run fast SQL queries on large datasets. When working with BigQuery, organizing data efficiently is crucial. One of the fundamental building blocks of BigQuery is a Dataset. Understanding what a dataset is, how it works, and how to use it effectively is essential for anyone new to BigQuery.

1. What is a Dataset in BigQuery?

A dataset in BigQuery is a logical grouping or container that holds tables, views, and models within a Google Cloud project. Think of it as a folder that organizes your data logically, making it easier to manage access and storage.

A dataset provides the following benefits:

· Organization: It helps in structuring data into logical groups.

· Access Control: Permissions can be applied at the dataset level.

· Geographical Control: You can define the location of your data for compliance and performance optimization.

2. How to create a Dataset?

Step 1: Log in to Google Cloud Console

Open a web browser and navigate to the Google Cloud Console (https://console.cloud.google.com).

Enter your credentials to sign in to your Google Cloud account. Ensure you have the necessary permissions to create a dataset in BigQuery.

Step 2: Access BigQuery Studio

Once logged in, locate the Navigation Menu (☰) in the upper-left corner of the Cloud Console. Click on the menu to expand the options and scroll down to find BigQuery section.

Click on BigQuery -> Studio, to open the BigQuery workspace.

Step 3: In the Explorer panel on the left side of the BigQuery interface, locate your Google Cloud project. Right-click on the project name, then select Create dataset from the context menu.

You can see Create dataset popup.

In the Create dataset form, you need to fill below options.

a. Dataset ID: Provide a meaningful Dataset ID (a unique name for your dataset within the project). For example I given the name as test_dataset.

b. Location Type: When creating a dataset in BigQuery, selecting the appropriate Location Type is crucial, as it determines where your data is stored and processed. This decision cannot be changed later, so it’s essential to choose wisely based on your use case.

BigQuery provides two types of data and compute locations:

· Region (Single Region): A specific geographic location, such as London or Mumbai.

· Multi-Region: A multi-region is a broader storage option where your dataset is replicated across multiple regions within a designated geographic area (e.g., US, EU, or Asia). This allows BigQuery to balance workloads across multiple data centers for better resilience and availability.

Regardless of the location type you choose, BigQuery automatically stores copies of your data in two different Google Cloud zones within the selected region. This helps ensure data availability and durability, even in the event of failures.

Benefits of Choosing a Single Region

· Lower Latency for Local Users: Data is physically closer to applications and users in that region, leading to faster access and query performance.

· Cost-Effective: Regional storage is often cheaper compared to multi-region storage.

· Regulatory Compliance: If your organization is subject to data residency laws (e.g., GDPR in the EU or CCPA in California), choosing a specific region ensures that data remains within legal boundaries.

When to Choose a Single Region

· If most of your users and applications operate in a specific geographic area.

· If your business requires data locality due to compliance regulations.

· If cost optimization is a priority, as regional storage is cheaper than multi-region storage.

Benefits of Choosing a Multi-Region

· Higher Availability & Redundancy: Data is automatically replicated across multiple locations, ensuring greater resilience and better disaster recovery.

· Larger Quotas: Multi-region locations often provide higher quotas than single-region locations, making them ideal for large-scale data analytics and high-volume workloads.

· Optimized for Distributed Users: If your data is accessed globally, multi-region storage helps balance the load, ensuring faster access across different locations.

When to Choose a Multi-Region

· If you have a global user base that needs fast and consistent data access from different locations.

· If high availability and disaster recovery are a priority.

· If you’re working with big data processing, AI/ML models, or distributed workloads that require higher quotas.

For this example, I selected Region and choose Mumbai.

c. External Dataset: When creating a dataset in BigQuery, you have the option to mark it as an External Dataset. This setting is particularly useful when you want to query and analyze data stored outside of BigQuery without physically importing it into BigQuery storage.

An External Dataset allows you to connect BigQuery to external data sources like Cloud Storage, Google Drive, or other databases. The resulting dataset in BigQuery is read-only and contains tables that are linked to the external source.

This means that instead of duplicating data by importing it into BigQuery storage, you can directly query the external data on demand, reducing storage costs and keeping the source data intact.

For this demo, let’s not select this option.

d. Tags: When creating a dataset in BigQuery, you have the option to add tags. Tags play a crucial role in managing, organizing, and enforcing policies across your Google Cloud resources.

Tags are labels that help you categorize and manage your resources efficiently.

Each tag consists of:

· Tag Key – A unique identifier for the tag (e.g., environment).

· Tag Value – A value assigned to the key (e.g., production).

For example, if you want to label a dataset based on its purpose, you can create a tag with:

· Key: environment

· Value: development

This makes it easier to track, filter, and apply policies to datasets based on their assigned tags.

Let’s ignore for this demo

e. Advanced Options

e.1 Encryption: When creating a dataset in BigQuery, you have the option to configure encryption settings under Advanced Options. Encryption ensures that your data is secure both at rest and in transit.

BigQuery offers two types of encryption mechanisms:

Google-managed encryption keys (Default): Keys are owned and managed by Google. Requires no additional setup from the user. Provides built-in security, redundancy, and automatic rotation.

When to Choose Google-Managed Encryption

· You want a simple, hassle-free encryption mechanism.

· You trust Google to handle key management.

· Your organization doesn’t have specific compliance or regulatory needs requiring full key control.

· You want Google to automatically manage key rotation and security policies.

Limitations of Google Managed Encryption

· You cannot control or access the encryption keys.

· You cannot revoke access to the keys if needed.

· If compliance requires customer-controlled encryption, this option may not be sufficient.

Customer-managed encryption keys (CMEK) using Cloud Key Management Service (KMS): Keys are owned and controlled by you. Requires setup in Google Cloud KMS. Gives you full control over key rotation, access policies, and revocation.

When to Choose Cloud KMS Encryption (CMEK)

· You need full control over your encryption keys.

· Your organization has compliance requirements (e.g., GDPR, HIPAA, PCI-DSS) that require key ownership.

· You want to manage key rotation, expiration, and revocation manually.

· You need to restrict or revoke access to encrypted datasets for security purposes.

Limitations

· Requires additional configuration and management effort.

· If the key is deleted or becomes inaccessible, your data is permanently lost.

· Costs more than Google-managed encryption, as Cloud KMS incurs additional charges.

For this demo, let’s go with Google Managed Encryption Key option.

e.2 Case Insensitive option

By default, BigQuery is case-sensitive when handling table and dataset names. This means:

· "SalesData" and "salesdata" are treated as two different tables.

· "Customer_Info" and "customer_info" are considered distinct entities.

If you check the "Case Insensitive" option while creating a dataset:

· BigQuery will ignore case differences when referencing tables.

· You can refer to the same table using uppercase or lowercase letters, and BigQuery will treat them as the same.

For example, If you have a table named "EmployeeData", you can query it as:

SELECT * FROM my_dataset.EMPLOYEEDATA;
SELECT * FROM my_dataset.employeedata;
SELECT * FROM my_dataset.EmployeeData;

Let’s enable it for the demo.

e.3 Default Collation

Collation defines the rules for sorting and comparing string values in BigQuery. It primarily affects:

· Case sensitivity: Whether "Hello" and "hello" are treated as the same.

· Sorting order: Whether uppercase letters appear before lowercase letters.

· Filtering and searching behavior: Whether queries differentiate between "Apple" and "apple".

By default, BigQuery is case-sensitive, meaning

SELECT * FROM my_table WHERE name = 'Alice';

will not return a row where name is stored as "alice" unless collation is case-insensitive.

You can choose one of below options

[Default] (Case-Sensitive Collation):

· BigQuery’s default behavior (case-sensitive).

· "DataScience" and "datascience" are treated as different values.

· Useful for strict comparisons (e.g., usernames, product codes).

· Best suited for datasets that require exact matches.

und:ci (Case-Insensitive Collation)

· "DataScience", "datascience", and "DATASCIENCE" are considered the same.

· Makes search and filtering easier without worrying about capitalization.

· Ideal for user-generated content, names, emails, or any text that should be matched regardless of case.

For this demo, let’s go with Default Collation.

e.4 Default table expiration

If enabled, any new table created in this dataset will be automatically deleted after the specified number of days. It helps in managing storage costs by removing tables that are no longer needed.

Useful for temporary or staging tables that store short-lived data.

e.5 Default Rounding Mode

If enabled, this rounding mode will be applied to all numeric values stored in tables within this dataset. It helps to maintain consistency in rounding behavior across multiple tables.

Following table summarizes the supporting rounding modes.

Mode	Description	Example
ROUNDING_MODE_UNSPECIFIED (Default)	No specific rounding mode is enforced. BigQuery will use its default behavior, which might vary depending on the function or query used. Use when: · You are unsure about which rounding mode to apply. · You prefer to handle rounding explicitly within queries rather than enforcing a dataset-wide rule.
ROUND_HALF_AWAY_FROM_ZERO	Also called "Standard Rounding" or "Arithmetic Rounding." If a number is exactly halfway between two integers, it rounds away from zero. Use when: · You want consistent rounding behavior that follows the general expectation for financial and numerical data. · Your application requires traditional rounding rules used in most programming languages.	1.5 → 2 -2.5 → -3
ROUND_HALF_EVEN (Bankers’ Rounding)	If a number is exactly halfway between two integers, it rounds to the nearest even number. Also known as "Bankers’ Rounding" because it helps reduce bias in large datasets. Use when: · You are working with large datasets, statistical calculations, or financial applications where rounding bias must be minimized. · You follow GAAP accounting rules (Generally Accepted Accounting Principles).	1.5 → 2 (rounds up to even number) 2.5 → 2 (rounds down to even number) 3.5 → 4 4.5 → 4

e.6 Storage Billing Model

BigQuery offers two ways to measure storage usage for billing purposes:

1. Logical Bytes (Default): Logical bytes represent the raw size of data before any storage optimizations (e.g., columnar storage, compression, or clustering). BigQuery calculates storage costs based on the original data size before compression. This means you pay based on how much data you load, not how efficiently it is stored

For example, if you upload a 10 GB CSV file, even if BigQuery optimizes it to 2 GB through compression, you are still billed for 10 GB.

2. Physical Bytes: Measures the actual storage space consumed, including optimizations like compression. BigQuery bills you based on the space actually used on disk. Since BigQuery compresses data efficiently, this can significantly reduce your storage costs.

For example, If you upload 10 GB of JSON data, and BigQuery compresses it to 2 GB, you are only billed for 2 GB under the physical bytes model.

If you use Physical Bytes, your storage costs may be lower, but query costs might be harder to estimate because queries process data at the Logical Byte level.

Logical Bytes is the default for a reason, as it’s predictable and aligns with query costs.

e.7 Time Travel Window

The Time Travel Window in BigQuery allows you to access previous versions of your data for a specified number of days. This is useful for recovering accidentally deleted or modified data.

Available Options

· Minimum: 2 days

· Maximum: 7 days

· If left unspecified, the default is 7 days (or an organization-defined default).

After filling the options, click on CREATE DATASET button.

Upon successful creation of dataset, test_dataset is created and visible to you.

That’s it, Happy Learning…..

Previous Next Home

Programming for beginners

Saturday, 17 May 2025

Introduction to Datasets in BigQuery

No comments:

Post a Comment