This is a companion discussion topic for the original entry at Building a Kimball dimensional model with dbt | dbt Developer Blog
Hi folks!
A great article, thank you for compiling it all together!
Was confused by one moment : seems you suggest to create a visual ER diagram as only part of documentation, Part 6.
Typically I start with visual ER diagram , and even use it to generate tables DDLs and even dbt code(If I work with dbt).
Any reason you put it only at the end? Will not data engineers benefit from having it in from of them even before they started writing any dbt code, dbt tests etc ? Thoughts ?
Hey there,
Yes, it’s better practice to start with an ER Diagram first as a visual communication tool to stakeholders on what you’re going to create and quickly get their confirmation.
Usually, once the development process starts, the schema will change anyway.
At the end of the development process, it’s good to revisit the initial diagram and make any tweaks/updates to it to reflect the final ERD.
Cheers
Thank you, @jonathanneo, for your invaluable blog! It has been incredibly helpful during our transition from a traditional handcrafted DWH to dbt.
I have a question concerning the handling of missing records in dimensional tables. How would you suggest dealing with this situation?
Here are some related resources I found on the topic:
Design Tip #43: Dealing With Nulls In The Dimensional Model
How do you deal with missing dimensions for foreign keys? - In-Depth Discussions - dbt Community Forum (getdbt.com))
Hey @Koen, thanks for your question. The way I would handle records in the fact table that don’t match to a record in the dimension table is through the method that @josh suggested in his post here.
So, let’s say our dimension table is dim_user
, and we have the following columns:
customer_id
customer_key
(the surrogate key, created by usinghash(customer_id)
customer_name
In the dbt model used to generate dim_user
, I would add a row (using union all
) for the following record:
select
-1 as customer_id,
hash(-1) as customer_key,
'Unknown Customer' as customer_name
Then from the fact table (e.g. fact_sales
), I will try to generate a surrogate key that matches dim_user
.
select
sale_id,
price,
quantity,
hash(coalesce(customer_id, -1)) as customer_key -- if customer_id is null, -1 will be used instead.
from
{{ ref('staging_sales') }}
This way, the fact table will reference a customer with the customer name Unknown Customer
instead of null
which might cause confusion.
Hi @jonathanneo , thanks for the article. Finally someone talking about dimensional modelling in a dbt blog!
My question is the following though: you create a surrogate key for all models, even when they have a single natural key e.g.
{{ dbt_utils.generate_surrogate_key(['stg_product.productid']) }} as product_key,
Isn’t this unnecessary since you already have the productid
here? Or is it for consistency purposes? I have had this discussion with other AE’s where they feel that creating a surrogate when there is already a natural key is unnecessary.
Curious to hear your reasoning.
Kind regards,
Lasse
Hello @lassebenni
It’s recommended to use surrogate keys in your models to have control over your unique primary key. It prevents the DWH from being affected by operational changes we don’t control, such as a productid
being reused in the future. That’s among other advantages, like being able to support dimension change tracking.
But in this case, since the macro generate_surrogate_key
generates a hashed productid
, I don’t see how it’s different from directly using the natural key. Kimball recommends using an automatically incremented 4-byte integer but I don’t think that’s possible in dbt.
I think the best practice would be to hash your natural key and a timestamp field, and use that as a surrogate key.
Thanks for this; it’s answered a number of questions I had about the use of surrogate keys.
I’m currently working with a legacy three-tier EDW (Teradata) where we master our surrogate keys from the atomic tables in our raw tier into reference tables where they can then be reused across our platform. Is the above approach of generating the primary and foreign surrogate keys in the presentation tier sustainable across multiple marts e.g. if you had a generic dimensional model plus multiple domain-specific data models?
I’m new to dbt and am working out options of how to apply it In our current space, so I’m keen to understand what is good practice.
Good question, I’m not sure. If we’re going by Kimball, he recommends assigning the surrogate keys at the end of the transformation process, just before loading the dimension tables into the DWH. The key map table that matches natural keys with the surrogate keys is in the staging part of the ETL system. This key map table is also used during fact table processing later on to make sure referential integrity is maintained.
Here’s two diagrams showing how surrogate keys are managed, taken from The Data Warehouse Toolkit. But the book is 10 years old by now, so I don’t know how relevant that’s still is.