Understanding Google BigQuery: Tables, Caching, Partitioning, and Clustering

8 min readDec 21, 2024

Google BigQuery is a fully managed, serverless, and highly scalable data warehouse that allows businesses to run SQL-like queries on massive datasets. One of the key advantages of BigQuery is its ability to handle large-scale data storage and analytics with ease. To make data processing more efficient and cost-effective, BigQuery offers several features, such as different types of tables, data caching, partitioning, and clustering. This article will explore these features in detail, helping you understand when and why you should use partitioned or clustered tables in BigQuery.

Comparing SGBD traditional with Google BigQuery Structure

When dealing with BigQuery, we don’t have Databases schema and table like a traditional Databases, we have something similar as image above.

Types of Tables in Google BigQuery

In Google BigQuery, tables serve as the core unit for storing data. There are different types of tables that users can work with:

Native Tables

Native tables are the most basic type of tables in BigQuery. They store data that is loaded or ingested from external sources.

These tables can be static (data does not change after initial loading) or dynamic (data is updated regularly).

Native tables are managed by Google BigQuery and come with the full range of query capabilities, including the ability to use SQL for data manipulation.

External Tables:

External tables allow BigQuery to query data stored outside of BigQuery, such as data in Google Cloud Storage or Google Drive.

Unlike native tables, external tables do not store data in BigQuery. Instead, BigQuery queries the external data directly.

This is useful for integrating data stored in different systems without duplicating or migrating it to BigQuery.

CREATE EXTERNAL TABLE northwind.sales (
  Region STRING,
  Quarter STRING,
  Total_Sales INT64
) OPTIONS (
    format = 'CSV',
    uris = 'gs://mybucket/sales.csv',
    skip_leading_rows = 1);

Views:

A view is a virtual table based on the result of a SQL query. It doesn’t store data itself but dynamically retrieves data when queried.

Encapsulates complex SQL logic for reuse.

Views are useful for simplifying queries, enforcing data security, and optimizing query workflows.

Caching in Google BigQuery

BigQuery caches query results to improve performance and reduce costs. When you run a query in BigQuery, the results are cached for 24 hours by default. If you run the same query again within this time frame, BigQuery will retrieve the results from the cache instead of reprocessing the entire query. This helps save on both processing time and costs.

Here’s how caching works in BigQuery:

Query Result Cache: If a query is executed with the exact same SQL and parameters as a previous query, and the data hasn’t changed, BigQuery will use the cached results.
External Data Cache: If an external table (e.g., from Google Cloud Storage) is queried, the data is not cached by BigQuery, and each query will read the data again.
Non-Cached Queries: If the query involves a change in the SQL, new data is added, or if the cache has expired, BigQuery will re-execute the query.

The ability to cache query results is one of the key ways BigQuery can optimize performance. However, caching can also affect cost because queries that use cached results do not count toward query processing charges.

If the query uses non-deterministic functions; for example, date and time functions such as CURRENT_TIMESTAMP() and CURRENT_DATE, and other functions such as SESSION_USER(), it returns different values depending on when a query is executed so this aren’t cached!

Partitioned Tables in BigQuery

Partitioning in BigQuery is a powerful way to manage large datasets by dividing them into smaller, more manageable pieces, known as partitions. BigQuery supports partitioned tables based on date/timestamp columns or integer range columns.

Types of Partitioning in BigQuery

Time-based Partitioning

Partition by Date/Timestamp: The most common partitioning method in BigQuery, where a table is divided based on a date or timestamp column. For example, a table storing logs might be partitioned by the date the log entry was created.
When dealing with Partition by Timestamp, a new pseudocolumn will be add on table, something like: “_PARTITIONTIME”/ “_PARTITIONDATE”. Where _PARTITIONTIME is an UTC date that matches at value from _PARTITIONDATE.

select _PARTITIONTIME, * from northwind.order_details

Benefits: Partitioning by time improves query performance, as queries can focus on specific time periods and avoid scanning the entire dataset. This is particularly useful for time-series data.

Integer Range Partitioning

In this case, data is partitioned by an integer column, such as ID or some other range of numeric values.
Benefits: This is helpful when you have data that doesn’t naturally follow a time-based structure but still benefits from splitting into ranges to improve performance.

Why Use Partitioning?

Partitioning offers several key benefits:

Improved Query Performance: By reducing the amount of data scanned during queries, partitioning significantly speeds up query performance, especially for time-sensitive data. If a query filters on a specific time period, BigQuery will only scan the relevant partitions instead of the entire dataset.
Cost Efficiency: Since BigQuery charges based on the amount of data scanned, partitioning can reduce costs by minimizing the amount of data queried.
Efficient Data Management: Partitioning helps manage and archive data more effectively. For example, older partitions can be easily archived or deleted without affecting newer data.

However, partitioning can also have downsides. For instance, over-partitioning a table can lead to inefficient storage and additional complexity, especially if there are very few partitions or unevenly distributed data.

Let’s see some basic example

Integer Range Partitioning

CREATE TABLE northwind.orders (
  order_id INT64,
  customer_id STRING,
  employee_id INT64,
  order_date DATE
)
PARTITION BY RANGE_BUCKET(order_id, GENERATE_ARRAY(1, 10000, 100));

Let’s supose we have a dataset called northwind and we’re creating a table called orders.

The table is partitioned using RANGE_BUCKET on the order_id column.
RANGE_BUCKET groups the order_id values into ranges defined by the GENERATE_ARRAY function.
GENERATE_ARRAY(1, 10000, 100) creates an array of numbers from 1 to 10,000 with a step size of 100: Example: [1, 101, 201, …, 9901]. Each bucket (partition) holds order_id values that fall within a specific range (e.g., 1–100, 101–200).
If we insert some rows that order_id is out of range, they’ll inserted on a partition called: __UNPARTITIONED__
If the value is NULL them inserted on: __NULL__

select * from northwind.INFORMATION_SCHEMA.PARTITIONS
where table_name = 'orders'

A good nice feature of partition are the option: Require_partition_filter = TRUE.
When this option os enable the query can’t be run without a partition field filter.

Clustered Tables in BigQuery

While partitioning divides data into chunks based on time or numeric range, clustering organizes data based on the values of specific columns. This allows similar data values to be stored together in a table, improving performance when querying on those columns.

How Clustering Works

BigQuery automatically organizes your clustered data based on the columns you choose. For example, if you cluster a table by the region and product_id columns, BigQuery stores rows with the same region and product_id values together on disk. This minimizes the number of data blocks that need to be scanned when queries filter by those columns.

Why Use Clustering?

Clustering can be especially useful for:

Improving Query Performance: Clustering can make queries faster by reducing the amount of data scanned, particularly when filtering or sorting on clustered columns. It’s most effective when the clustered columns are frequently queried or used in JOIN operations.
Optimizing Large Tables: Clustering is often used in conjunction with partitioning for large datasets, where partitioning helps filter by time or range, and clustering helps further optimize data access by specific values.

Unlike partitioning, which requires a specific column (such as a timestamp or integer), clustering is more flexible because it allows you to choose one or more columns to cluster by.

Using the same example before, our DDL command will be something like that:

CREATE TABLE northwind.orders (
  order_id INT64,
  customer_id STRING,
  employee_id INT64,
  order_date DATE
)
PARTITION BY RANGE_BUCKET(order_id, GENERATE_ARRAY(1, 10000, 100));
CLUSTER BY customer_id;

Partitions splitted by cluster customer_id

When Should You Use Partitioning and Clustering?

Partitioning:

Your dataset has a dimension;
You frequently run queries that filter on specific time periods or ranges, such as analyzing historical data.
You want to reduce the amount of data scanned for time-based queries, improving performance and reducing costs.

Clustering:

You need to optimize queries that filter or join on one or more columns that are not time-based.
You have large datasets with specific columns that are frequently queried, such as region, product, or customer ID.
You want to further optimize query performance by reducing the data BigQuery needs to scan when filtering by clustered columns.

Note: You can use partitioning and clustering together to take advantage of both techniques. For example, a table can be partitioned by date and clustered by region and product_id for efficient querying of specific time periods and business areas.

Conclusion

Google BigQuery offers powerful features for managing and querying large datasets, including different types of tables, caching mechanisms, and data partitioning and clustering options. Understanding how to effectively use partitioned and clustered tables is key to optimizing query performance and managing costs.

Partitioning is ideal for datasets with time-based or numeric data that need to be segmented for efficient querying.
Clustering helps optimize queries based on specific column values, making it useful for large datasets where certain columns are frequently queried or joined.

By carefully choosing the right strategy for partitioning and clustering, organizations can significantly improve the performance and cost-efficiency of their data analytics in BigQuery.

Understanding Google BigQuery: Tables, Caching, Partitioning, and Clustering

Comparing SGBD traditional with Google BigQuery Structure

Types of Tables in Google BigQuery

Native Tables

External Tables:

Views:

Caching in Google BigQuery

Partitioned Tables in BigQuery

Types of Partitioning in BigQuery

Time-based Partitioning

Integer Range Partitioning

Why Use Partitioning?

Let’s see some basic example

Integer Range Partitioning

Clustered Tables in BigQuery

How Clustering Works

Why Use Clustering?

When Should You Use Partitioning and Clustering?

Partitioning:

Clustering:

Conclusion

Written by Rafael Rampineli

No responses yet