Traditionally in Inmon-style data warehouse design, there is an intermediate layer between the raw “data lake” and mart (dimensional) layers. This is either accomplished with data vault (difficult), or a normalized design (easy).
Then marts are built on top of these. The idea is that is if you design your marts wrong (stovepipe designs), you have nothing to fall back on and must re-do them. Additionally, and more importantly, the intermediate layer accurately reflects you business processes.
With dbt, so far, most of what I am seeing is source (data lake) ==>stage (light transform)==>mart (fully transformed) data designs.
Why is this? I.E. Is there something about dbt that somehow eliminates the need for a middle layer? I am new to dbt, but the concept is still the same.
At Trove, we’ve very much built an Inmon-style data warehouse. dbt works very well for this.
To roughly characterize our model, there are core tables in the business-entity layer: a normalized representation of the business and related data entities, just as you suggested. Then there are summaries, aggregations, snapshots used in data marts. When using visualization tools like Looker, it’s more efficient to use the data marts. The core model is often the target of exploratory data analysis.
The core models are rebuilt in one dbt run each night. The marts are built in a separate run executed later. dbt would equally support building them all in a single run: the built-in dependency resolution would mean that the core models would be built first and the marts after.
Throughout the business day, the core models are updated incrementally every 10 minutes with fresh source data. Depending on how they’re used, the data marts may or may not be incremental models.
You have communicated that perfectly. I was worried that everyone in the dbt world was skipping that business layer. Do you have any thoughts on using Data Vault as a business layer? I was at a large company where they had a Data Vault business layer. For me it seemed overly complex, but I am no data vault expert.
I have implemented DV at a large organization, and you are right it can be insanely complex, and would potentially be over engineering for a large number of scenarios.
But I will say this - it’s a high appropriate solution that pays massive dividends if :
- If your organisation has many disparate and changing source data sets, but at the same time a reasonably stable , and logical, business entity model.
- You have a DW automation tool to automatically build out assets and pipelines based on the vault model (like BIMLFlex)
Other lesser reasons besides. Also building data marts from a good data vault is trivial, and more often that not can be entirely virtualized in views.
Doing it in DBT I don’t know - but probably wouldn’t try it without DW automation.