Large incremental models

Hi all.

Recently, we spent some effort refactoring our models to make them more incremental in a hope that it reduces our warehousing bill. It worked for few large tables (with some correctness caveats mentioned here ). Upon looking at the query execution flow, we found that actually for most of the incremental models, the ETL time is very short , while the merging time is long. We assume this is due to the large base table which needs to be joined with the incremental temp table for the merge, thus leading to a full table scan of the base table.

An ad-hoc solution would be reducing the size of the base table by partitioning it in a semi meaningful manner. e.g. click_events_2019 , click_events_2020, click_events_2021, then during the incremental run only merge with the latest table (2021), and expose union of those tables for analytics. However, a full-refresh/backfill will get tricky this way. I was wondering how anyone else is solving this problem? is there any way to batch things easily without translating them into an operational problem?

1 Like

Hi ablimit,
I think you have two major directions you can work on:

  1. removing the merge and do only inserts, keeping all the history of changes and getting the “latest” on the way out of the table, if possible, as inserts are way faster than updates;
  2. limiting what gets in, with a where clause on the incoming tables; in many DBs this will be beneficial also on the target table, if you partition it using the same field you use for the filtering.

The first one is the typical play from Data Vault, you just insert new stuff (for your definition of new) and give one or two ordered timelines to get your data out in any order you are interested into (first being when you got the info into the table -load timestamp-, typical second is when it happened in real life -effectivity-). this removes the “out of order” problem in the load as you just solve it at query time according your need.

Two is just enforcing some acceptable limitation to what you are going to transform.
Even there you could have the two timelines (received VS happened) to help you narrow down what to pick up in the transformation on the incoming side.

If you can use both approaches then you can, for example, limit both incoming tables and dest table to events to specific field values (for example year = 2021) and just insert the ones that are new or changed.

If even after filtering you have huge numbers and your DB engine does not make good use of filters in the join keys then you can try a “brute force” approach to use the info that you have on the structure of the join key, even if I expect modern optimisers to be good at that if you have recent statistics (and good partitioning set up where you need to do it).

The idea is to exploit any field with a low cardinality that is part of your merge key and break down the huge merge in smaller ones, one for each value of the key.

With simple scripting in DBT (a loop on the values the key has, at the granularity you like) you could break down a one merge (like by year) into many smaller merges (by month or even by one day), and then union them back to be inserted.
I expect your DB engine to be smart enough to do it with low cardinality fields, and also prune partitions with date fields, but you can always try at different granularities (month, week, day) and see the performance.

-- build one CTE for each value of the key 
for key in key_values
  select from t1 where f = {{key}} and ...
  select from t2  where f = {{key}} and ...
  ...
  join t1, t2, ... as you need 
end for

-- join them all
for key in key_values
  select * from CTE_key 
  UNION (all?) if not last
end for
2 Likes

Hi there ablimit,
We get a list of all the distinct values for the partition column that are part of the current incremental run. We then use this list in the MERGE statement. However to do this we needed to change the MERGE macro. The resultant where clause looks like this (PROCESS_DATE is the partition column):

..     ) as DBT_INTERNAL_SOURCE
    on ( DBT_INTERNAL_DEST.PROCESS_DATE = "2020-10-17" or DBT_INTERNAL_DEST.PROCESS_DATE = "2020-10-18" ) and 
        DBT_INTERNAL_SOURCE.EVENT_GUID = DBT_INTERNAL_DEST.EVENT_GUID

Hope this helps. If you need more detail, please let me know.

1 Like