dbt supports two types of date partitioning in BigQuery:
This post provides background knowledge for these two types of partitioning, as well as some other helpful context.
Partitions
In databases, “partitions” are subsets of data that can all be grouped together. One natural grouping of data is a date field, like a created_at
timestamp. Databases are able to store the data for these partitions separately, meaning that queries which filter on partitions can limit the amount of data that needs to be scanned. In turn, this helps reduce query costs and improve query speeds.
Partitioning in BigQuery
BigQuery has a few different features that all do something like “partitioning”, but they all work differently and have different semantics.
date sharding
This isn’t actually “partitioning” per se, but it’s a mechanism for limiting queries to specific days of data. You can create data sharded tables by creating multiple tables with a date suffix, eg:
dbt_dbanin.orders_20180101
dbt_dbanin.orders_20180102
dbt_dbanin.orders_20180103
...
To query for a specific day, you can query one table in particular:
select * from dbt_dbanin.orders_20180101
You can also use a wildcard character to query across a union of days, eg:
select * from dbt_dbanin.`orders_201801*`
This query will select data from a union of all of the tables for January 2018.
Ingestion-time partitioned tables
These were BigQuery’s first “real” attempt at partitioning. They’re called “ingestion time” partitions because they’re meant to be incrementally loaded into on a daily basis. These work a lot like the date sharded table example above, except there’s only one table with many partitions instead of one table per partition.
You can’t create these ingestion-time partitioned tables in SQL. Instead, you need to issue an API call to create an empty table, setting the table’s partitioning_type
attribute to DAY
.
When loading data into an ingestion-time partitioned table, you can specify a “table decorator” which signifies the day partition to insert data into. Again, this unfortunately can’t be done in SQL and instead requires the use of the BQ API.
When querying an ingestion-time partitioned table, you can use the _PARTITIONTIME
pseudocolumn to limit the partitions that are scanned by your query. That looks like:
select * from dbt_dbanin.orders
where _PARTITIONTIME = '2018-01-01'
This has the same effect as date-sharding, but is a little easier to work with because you can use real date ranges as filters instead of wildcard characters. You can find more information about creating ingestion-time partitioned tables with dbt in the docs.
Column-partitioned tables
Column partitioned tables work a lot more naturally than ingestion-time partitioned tables. When creating a table, you can designate a column to be the “partition column” in SQL. Once the table exists, you can filter on that partitioning column to limit the amount of raw data scanned by your query.
You can create a partitioned table with:
create table dbt_dbanin.orders
partition by order_date -- <--- this is the partitioning column
as (
select ...
)
This will create a table called dbt_dbanin.orders
with the order_date
specified as the partitioning column. When querying the orders table, any static filter against that column will limit the amount of raw data scanned. Eg:
select * from dbt_dbanin.orders
where order_date = '2018-01-01'
Filtering on a specific column is a lot more natural than using the _PARTITIONTIME
pseudocolumn, and is generally the approach that I would recommend using in all cases. You can find instructions for creating column-partitioned tables in the docs.
We implemented ingestion time partitioning in dbt before column partitioning existed. I think at this point, there’s very little reason to use ingestion time partitioning in a data modelling use case, and we will likely deprecate ingestion-time partitioning at some point in the future.