How do you deal with missing dimensions for foreign keys?


What’s the best practice in DBT for dealing with missing dimensions for foreign keys?

For example, I have a fact_orders that contains a foreign key to a dim_funnel. The rows in dim_funnel have to be maintained manually, so it can frequently happen that orders happen on a funnel, but the dimension row for the funnel hasn’t been added yet.

In a traditional DWH model we would include a special row like “Missing Funnel” with a -1 key for example. And then in the fact table query we would check if the correct foreign key exists in the dimension table and if not set it it to -1.

A similar situation can happen when dimension records are deleted, but the facts referencing it are kept.

I guess when generating the fact table, I could do a join on the dimension to check if the key is there, but this prevents one of primary benefits of using hash keys-- being able to load dimensions and facts in parallel.

So I am just wondering how you guys are handling such a situation.

Thanks, Kim

We still create a “-1” “Unknown” row in our dimensions. Although it’s not actually “-1” literally, it’s the hash of “-1”. And then we have a variable in our dbt_project.yml file like this:

unknown_key: "CAST(FARM_FINGERPRINT('-1') AS INT64)"

Then when we build our fact tables and need to populate the keys to join to the dimension tables, we do something like this:

COALESCE(pers.person_key, {{ var('unknown_key') }} ) AS person_key,

I never really thought of one of the primary benefits of hash keys as being able to load dimensions and facts in parallel per se.

1 Like

Fantastic, that makes a lot of sense. Thank you for sharing.

And in this example pers is the joined dimension table, right?

And how do you “append” your -1 dimension row in the dimension?

Do you just add a union at the beginning of the dim table?

Yes, in this case pers is the joined dimension table. And yes we just do a UNION with the query that is producing the dimension to append this row.

1 Like

Hi @Kimcha, there is an interesting article from Roleant Vos on the subject (in the data vault realm) : unknown-keys-zero-keys-or-ghost-keys-in-hubs-for-dv2-0, with a taxonomy on missing values. In my case, we use two distinct missing key values : one when the key is missing but mandatory (problem in the loading or in source data), and another when the key is missing but optional (loading and data ok but data is missing for ‘business’ reason).