Date partitioned tables in BigQuery

Hey guys,

I’m working on an implementation where we look into using date partitioned tables. So partitioning based on a date column.

Why we are doing it?
These are event tables with a bunch of data coming in on a daily base.
We are already using an incremental materialization which makes dbt run straight forward. But still the increment model consumes a lot GBs for getting the query.

If we would switch to partitioned tables we could save time and GBs in queries.

I looked into your docs describing the approach for ingestion time based partitions: https://docs.getdbt.com/docs/creating-date-partitioned-tables

But I didn’t find a proper way to use that for date partitioned.

Problem is how I can use the [DBT__PARTITION_DATE].

In a date column based partition you simply use a
WHERE DATE(date_column) = “2020-02-02” to limit the query.

I’m trying to facilitate [DBT__PARTITION_DATE] for this. Unfortunately it comes in the wrong format (20200202).

I already tried to use some Jinja magic (datetime parsing and format), but it seems that the [DBT__PARTITION_DATE] replacement happens later and so I run into compile errors.

When I do PARSE_DATE([DBT__PARTITION_DATE],"%Y-%m-%d") I don’t save any GBs when running this query.

Any thoughts what would work in that case?

Hi @timo, this is a great question! You mention two different approaches that dbt takes to partitioned tables in BigQuery, and I’ll explain why I prefer one of them.

Partition on column values

The first is what I refer to as a column-partitioned table. This includes tables partitioned on a date or timestamp column as well as tables partitioned on a range of values in an integer column, which is a relatively recent BQ feature. The way to encode column-based partitioning in dbt is by using the partition_by model config.

We are already using an incremental materialization which makes dbt run straight forward. But still the increment model consumes a lot GBs for getting the query.

You’re right that the dbt incremental materialization has not been cost-effective for partitioned tables on BigQuery. Even if you limit the amount of scanned data to generate the “new” records, using a where statement couched within the is_incremental() macro, dbt was missing a critical filter in the merge step.

The really good news is that dbt 0.16.0 (currently in prerelease) is going to be a lot smarter about partitioned incremental models in BigQuery. I’ll be writing a post about this very soon, but you can get a sneak preview by checking out #2140.

Partition during table creation

The second approach is what BigQuery calls ingestion-time partitioned tables, and what dbt documentation confusingly refers to as date-partitioned tables. The idea is that dbt will insert or update your table in batches, and you can use a pseudo-column (where _partitiontime = '2020-02-02') or table decorator (my_table$20200202) to limit the amount of data scanned in a query.

Conceptually, this is similar to the dbt_utils.insert_by_period materialization, and in the same vein as some other batch-based “Big Data” technologies, where data is only ever inserted, updated, or deleted in atomic units of one day.

This second approach is actually older than the first; BigQuery released support for ingestion-time partitioning before it supported partitioning a table on its own date or timestamp column. This approach feels more old-school, too: it’s all about operating on metadata, rather than assisting the database in extracting metadata from your data proper.

Conclusion

I think the release history is a reason why the feature naming can be quite confusing. In general, I use column-based partitioning on almost every table, and ingestion-time partitioning rarely, if ever. With the improvements coming in dbt 0.16.0, I don’t see that changing any time soon.