Is there a design pattern similar to inferred member dimensions for handling late arriving dimensions when loading facts

Late arriving master data/reference data is a common scenario when loading facts. E.g We receive a transaction carrying a store number (master data) or currency code (reference data) that has not yet been received in the dimension load from the same source.

The data engineering team at the current site has taken the approach of looking up the surrogate key during the transactional load and substituting the failed lookup with the surrogate and natural key of the default unknown member record in the corresponding dimension. They have not developed any sophisticated techniques for logging the failed lookup’s and then re-processing the fact record so that the reference to the unknown record is substituted with correct dimensional record when it becomes available from the true source.

A popular technique using traditional tools such as SSIS, data stage, talend was to create an inferred member row by calling a stored procedure and passing the natural/business key of the reference/master data code on detecting a failed dimensional lookup. They had somewhat out-of-the-box support for dealing with a failed lookup, creating the inferred member record and using the surrogate key returned by the sp call.

Is there a best practice/recommend design pattern for handling handling late arriving dimensions when loading facts?

I’m using dbt_utils.generate_surrogate_key() to manage my dimension keys. I pass my natural/business key into it, but am running into an issue with my design where late arriving dimensions are not accounted for. My current thought is to create a lookup table right after everything is staged and drive everything off of that. This lookup would contain the natural/business key(s) from the dimension but also from all facts. By including the natural/business keys from any facts, effectively nothing will be “late arriving”. That lookup table would have a surrogate key generated (could just as well do an auto-increment integer if you wanted).

The problem with this is you now force your facts to be created after this lookup, so you lose the parallelization that would exist if you simply called dbt_utils.generate_surrogate_key() on both your fact and dimensions. In addition, in the process of creating this key lookup you would potentially process many, many records. Say you have 100s of transactions (facts) per day for customers (dimension). Now you are trying to get the distinct set of customer IDs from your transactions just to make sure it’s in the lookup and has a surrogate key created: not great, especially if you aren’t doing incremental materialization.