Hi ablimit,
I think you have two major directions you can work on:
- removing the merge and do only inserts, keeping all the history of changes and getting the “latest” on the way out of the table, if possible, as inserts are way faster than updates;
- limiting what gets in, with a where clause on the incoming tables; in many DBs this will be beneficial also on the target table, if you partition it using the same field you use for the filtering.
The first one is the typical play from Data Vault, you just insert new stuff (for your definition of new) and give one or two ordered timelines to get your data out in any order you are interested into (first being when you got the info into the table -load timestamp-, typical second is when it happened in real life -effectivity-). this removes the “out of order” problem in the load as you just solve it at query time according your need.
Two is just enforcing some acceptable limitation to what you are going to transform.
Even there you could have the two timelines (received VS happened) to help you narrow down what to pick up in the transformation on the incoming side.
If you can use both approaches then you can, for example, limit both incoming tables and dest table to events to specific field values (for example year = 2021) and just insert the ones that are new or changed.
If even after filtering you have huge numbers and your DB engine does not make good use of filters in the join keys then you can try a “brute force” approach to use the info that you have on the structure of the join key, even if I expect modern optimisers to be good at that if you have recent statistics (and good partitioning set up where you need to do it).
The idea is to exploit any field with a low cardinality that is part of your merge key and break down the huge merge in smaller ones, one for each value of the key.
With simple scripting in DBT (a loop on the values the key has, at the granularity you like) you could break down a one merge (like by year) into many smaller merges (by month or even by one day), and then union them back to be inserted.
I expect your DB engine to be smart enough to do it with low cardinality fields, and also prune partitions with date fields, but you can always try at different granularities (month, week, day) and see the performance.
-- build one CTE for each value of the key
for key in key_values
select from t1 where f = {{key}} and ...
select from t2 where f = {{key}} and ...
...
join t1, t2, ... as you need
end for
-- join them all
for key in key_values
select * from CTE_key
UNION (all?) if not last
end for