Background
dbt version 0.16.0 released a new strategy, insert_overwrite
, for dbt’s incremental materialization on BigQuery. (There’s an overview here.) I’m interested to see how changing the strategy, paired with other configuration options, impact the performance of a dbt model’s incremental run.
Setup
I’m querying a table of Wikipedia page views in 2020. In a given day, there are ~150 million page view records, totaling 7 GB. Each record represents a given page being visited one or more times within a given date-hour. The table is partitioned by datehour
, and it records the number of views that occurred in each hour on a given page.
Variables
Modeling
I am going to tag pages based on some arbitrary attributes, and create three incremental models:
- Enriched: Returns the set of all pages visited, with additional columns describing the language and subject. The enriched table does some minor cleaning but also adds columns, so it stores ~130 million records and 9 GB per day-partition.
- Aggregated: Aggregate the pages by hour, language, and subject, returning only the total number of views. The summary table represents a significant reduction in dimensionality: one day-partition contains ~10k records and 500 KB.
- Goldlilocks: A little bit of enrichment and a little bit of aggregation. One day-partition contains ~30 mm records and 2.5 GB.
Incremental strategy
-
merge
(simple) -
merge
with cluster key -
insert_overwrite
with static partitions -
insert_overwrite
(dynamic) - Control:
full refresh
Data volume
With X number of days of “old” data already in the table, I incrementally run for Y days of data (a subset of X) and merge them back into the table. This should produce an identical output to fully refreshing the table with X days of data. Each incremental run is labeled as X-Y days of data.
- 3 days old + 1 day new (3-1)
- 7 days old + 1 day new (7-1)
- 7 days old + 3 days new (7-3)
- 14 days old + 3 days new (14-3)
- 30 days old + 3 days new (30-3)
Criteria
When I talk about model performance on BigQuery, I’m talking about:
- Speed: How many seconds does it take to run our incremental DML? Is it significantly better (or worse!) than control (the full refresh build)?
- Cost: How many bytes are scanned by our incremental query/queries? Is it significantly cheaper than control?
- Scale: How do runtime and cost increase as I increase the overall volume of data (“old”)? The volume of “new” data?
Results
Here is a spreadsheet with all observations.
Findings
- Whereas bytes processed is a fixed number, given a static data input, model build times vary significantly. Interestingly, runtimes are relatively flat across data volumes—massively parallel processing, perhaps.
- Across the board, static insert_overwrite is cheapest and almost always fastest. The dynamic insert overwrite strategy is always slowest.
- Among merge strategies, the clustered merge is always cheaper than simple merge, and they’re neck-and-neck for runtime.
- In the aggregated model, the target table is so small that a simple merge is equally performant with merge clustered and a static insert overwrite. The additional queries run by dynamic insert overwrite are worth neither the added time nor additional bytes billed.
- In the enriched model, because the target table is larger than the source table, most strategies are worse than control (full refresh) in terms of both bytes and speed. At scale (30-3 days of data), clustered merge and dynamic insert overwrite start outperforming the full-refresh build.
- In the goldilocks and enriched models, merge strategies scale in GBs processed more linearly than both insert overwrite strategies. Despite having a higher baseline cost—the y-intercept, which I can infer from 3-1—the dynamic insert overwrite starts outperforming the clustered merge in cost terms when incremental runs are processing ~100 GBs.