Incremental model with multiple slowly changing dimensions

Hello there,
I’m working on a model in dbt v 0.20 that I can’t seem to update with a successful incremental strategy. I’d love some ideas on whether this is an appropriate strategy, and if I’m thinking about it the right way.

It’s a denormalization of a sales order record with the customer ID with a household ID associated to that customer. All flattened into a single row. There are a few source applications that feed into a warehouse. I’m using the warehouse as my source for the dbt models. A view is not possible because the resulting table is quite large and often used, so our DBAs have asked that it be materialized. Nor can it be truncated/loaded with each model run.

The sales order can update; e.g., it may change from Pending to Fulfilled. And the source for this uses a type 2 versioning (version start & version end dates), in addition to a source update timestamp from the source system and an ETL timestamp as the source loads into the warehouse. We want the model to expose only the current, latest version.

The association of a sales order to a customer ID can update; seems weird, I know. But it can happen. The association of an order to a customer in our source is also type 2 versioning. We want the model to expose only the current, latest customer(s) associated to the order.

The association of a customer to a household can also update. Householding is algorithmically driven and happens outside the source for orders & customers. Our warehouse receives updated linkages between a customer and a household sporadically. Updates can also arrive out of logical order. That is, the warehouse may receive an update today that was generated by the householding process from a week ago. So newest records are not always current records. There is no versioning. But there is a timestamp associated with when the algorithm determined the association, and a timestamp of when the warehouse received the record from source. We want the model to expose only the current, latest household associated with the the latest customer(s) associated with the order…

As I’ve tried to model this out as an incremental model, I have trouble with two places:

  1. Because an update could happen to an order, or to the association of an order to a customer, or the association of a customer to a houshold, I have 3 conditions chained by ORs in my {% if is_incremental() %} ... {% endif %} block. That’s not a problem per se, but it feels computationally inefficient.
  2. Because the householding can be received out of order, and does not have good versioning on it, I don’t know how best process it. I feel like what I want is to take the newest records loaded into the warehouse, and only UPSERT them if they are the latest record from source. But that’s now how I understand dbt incremental models to work.

Any advice for this type of problem; or any advice on if I should be looking at it differently? Cheers!

  1. There’s nothing inherently wrong with multiple conditions, so I don’t think this is an issue.
  2. Add a column in the model that is something like household_loaded_at and then add a or condition to your is_incremental() block to get household rows with a newer loaded_at than household_loaded_at. If you use the sales order id for the unique_key config, dbt will do an upsert for you. And it’s best practice to put a test on the model to ensure the sales order id is in fact unique.

Much obliged @andrewwilson I think I got it working.
In short, I use a series of CTEs that find new records for each prior source table, combined into a driver CTE that can then be used to fetch anything that is incremental to the existing table. This means dragging along ETL AND system update timestamps from the parent models so that we can test for any changes to each.

In this example, I build a driver table of any new ETLd records for the sales order; and separate CTEs do the same for sales order to customer ID, and customer ID to household. The trick is the these CTEs need to return the same fields as what goes into the model’s unique (composite) key. E.g., the order ID-customer ID combo.
Then, combine the CTEs into a driver table and go build out the current record version of order-customer-household using the latest source system update timestamps. UPSERT these into the table, merged on the unique key.

Hope that helps someone else who might run into the same trouble.

1 Like