Using window functions in incremental tables

Heh. You’re diving into the deep end here.

Incremental models in dbt are wonderful and many times they just work, and make incremental builds very straightforward. There are two specific cases when they don’t work great, though: distinct and window functions. Both of these require more context than just recent data to appropriately calculate their values.

There are two different ways to deal with this, and each of them take a bit more work than what would typically be involved in building a dbt model:

  1. Incremental model >> view or table.
    Put the stuff that can be incrementalized into an incremental model and then layer another model on top of it for distincts or window functions. Sounds like this is what you’re doing today. I’d recommend putting a table on top of it instead of a view so that you can limit the complexity of downstream queries…the extra storage is cheap and the strategy you outline just adds complexity.
  2. Write custom logic to combine results
    It’s also possible to query the existing table and get the current value of, for example, a row number. Then, you can add that row number value to the incremental data to get the true value for incremental rows. This strategy allows you to incremental-ize the whole transformation, but can require meaningfully more code. When you’re transforming datasets of many terabytes, this type of strategy is very important because strategy #1 just isn’t really feasible. Use specific database features like sort key or date partitioning to get the current values without performing full table scans.

It’s possible to go deep down a rabbit hole with #2, creating intermediate tables storing all kinds of bookmark values that essentially just allow your incremental jobs to work. It’s all about a performance / code complexity tradeoff though. It sounds like your current approach is working out ok.

5 Likes