Using window functions in incremental tables

TL;DR: Should I just create views on top of incremental tables? My gut says probably.

We have a number of incremental tables that we refer to as ‘daily timeseries’ tables. They reflect the state of a record at the end of each day.

dbt is great for these materialising these tables, except in a few use-cases where we need to use window functions that evaluate the whole table. Because we only load yesterday and today’s data in each incremental load, the window functions would only evaluate the values for those two days instead of the whole table.

We have started building views on top of the incremental models to calculate the window functions, which is working okay, but I wanted to see if there was a preferable way of dealing with this that others have found? I don’t really have an issue with the view method, except that it appears to make some queries against the final output a bit slower, even when the window function isn’t being used. (For the most part, we get around this issue by dynamically changing which model is queried in Looker using the in_query liquid variable in LookML.)

Has anyone else found other ways of dealing with this? Thanks!

Heh. You’re diving into the deep end here.

Incremental models in dbt are wonderful and many times they just work, and make incremental builds very straightforward. There are two specific cases when they don’t work great, though: distinct and window functions. Both of these require more context than just recent data to appropriately calculate their values.

There are two different ways to deal with this, and each of them take a bit more work than what would typically be involved in building a dbt model:

  1. Incremental model >> view or table.
    Put the stuff that can be incrementalized into an incremental model and then layer another model on top of it for distincts or window functions. Sounds like this is what you’re doing today. I’d recommend putting a table on top of it instead of a view so that you can limit the complexity of downstream queries…the extra storage is cheap and the strategy you outline just adds complexity.
  2. Write custom logic to combine results
    It’s also possible to query the existing table and get the current value of, for example, a row number. Then, you can add that row number value to the incremental data to get the true value for incremental rows. This strategy allows you to incremental-ize the whole transformation, but can require meaningfully more code. When you’re transforming datasets of many terabytes, this type of strategy is very important because strategy #1 just isn’t really feasible. Use specific database features like sort key or date partitioning to get the current values without performing full table scans.

It’s possible to go deep down a rabbit hole with #2, creating intermediate tables storing all kinds of bookmark values that essentially just allow your incremental jobs to work. It’s all about a performance / code complexity tradeoff though. It sounds like your current approach is working out ok.

5 Likes

:+1: Thanks @tristan. That makes sense. Do you have any publicly available examples of #2? We will definitely stick with #1 for the moment, and take your suggestion about materialisation, but want to understand what that next step looks like.

I actually don’t think we do. The only times we’ve had to do that are on very specific client applications, and that code doesn’t lend itself well to being open sourced. Sorry about that!

We did this by creating an updated_key column to use for the incremental sql_where value, which I think is what Tristan is saying in #2. In our case, we are just using lead() to grab information from the record immediately after the current row:

-- Check if there is another bid after this, if not, then this is the winning bid
lead(bids.bid_id, 1) OVER (partition by bids.item_id order by bids.bid_id) IS NULL AS is_winning_bid,

In this example, is_winning_bid is dependent on whether another bid exists after this one for the item. With an incremental model only built from bid_id or updated_at, this would fail to update if a new high bid is created after the table is built. So, we check for both bids in our sql_where key:

-- This field is used to let dbt know what records need to be updated since the model last ran. Usually,
-- we would just use an updated_at timestamp, but this model is also dependent on the bid placed after the current
-- bid, so we need to use the greatest of the bid_id from both bids.
greatest(bids.bid_id, lead(bids.bid_id, 1) OVER (partition by bids.item_id, bids.bidder_id order by bids.bid_id)) AS data_updated_key,

And then our incremental config:

{{
  config(
    materialized = 'incremental',
    sql_where = 'data_updated_key > (select max(data_updated_key) from {{ this }})',
    unique_key = 'bid_id'
  )
}}

So now our data_updated_key will be the newest of both bids in our window function. This is a simple example as we are only looking at one additional row using the lead() function. As Tristan mentioned, this could get considerably more complex if you have multiple different window function columns you need to consider for.

1 Like

I used tristan’s #2 approach at one of our web analytics data pipeline, for our daily incremental update. We actually implemented the rabbit role with #2 that @tristan talked about. I found that additional bookmark table make the whole logic easier to understand, maintain and debug, though it requires a bit more code complexity.

2 Likes

Interesting! I’ve just completed the dbt Fundamentals course (thanks for the awesome course!) and trying to evaluate how it suits our needs.


There’s an ever-growing list of raw immutable events:

  • timestamp - date and time of an event

  • id - agent ID

  • is_online - true if an agent got online, false if went offline

Milestone no.1 is to calculate the number of unique agent IDs that were online (even for a moment) on a given day, i.e.:

  • day - date

  • num_online_ids - number of unique agent IDs that were online (even for a moment)

Milestone no.2 is to do the same on a rolling 28-day basis:

  • day - date

  • num_online_ids_28d - number of unique agent IDs that were online in the previous 28 days

Incremental models in dbt are wonderful and many times they just work, and make incremental builds very straightforward. There are two specific cases when they don’t work great, though: distinct and window functions. Both of these require more context than just recent data to appropriately calculate their values.

Sounds like it might be a tough nut to crack, because right away I can see distinct and window functions coming into play.

I was thinking maybe aggregate periods when agents were online into non-overlapping timestamp ranges:

  • id - agent ID

  • start_ts - date and time when an agent got online

  • end_ts - date and time when an agent went offline

Then explode those into ID-day pairs leveraging something like dbt_utils.data_spine:

  • id - agent ID

  • day - date when an agent was online

But I realize the SQL will be getting very hairy pretty quickly.

Have there been any new developments/considerations/breakthroughs in this area in the past couple of years @tristan ?

Being completely new to dbt, I’m afraid of missing something obvious or already implemented/packaged.

Thanks in advance!!