I currently have a DAG running in Airflow that consists of 22 steps and next to this uses some sources that are created in this scope as well. This is all SQL so very much suited to bring over to dbt. However, I’m wondering what the best practice is to separate and split up SQL operations into the different layers and how to handle intermediate steps properly and how to organize this in proper folder structure.
I want to keep/create clarity without making it to complex by breaking it up into to many pieces.
- do you always use a separate staging model if you do anything more than a
SELECT *
on a source? (ie. make some field lower case and remove spaces) - when you build a certain collection of data to be used as a source in 1 process only, would you consider this to be a mart layer and a FACT table or do you keep it in staging layer (ie. a combined collection of products that are located in different sources, to further use to map against in downstream steps)
- when simply combining two intermediate models with a UNION for further use in downstream intermediate models, would you consider this to be a model in the staging layer since it is such a simple operation, or do you ‘never go back’ and is this also considered to be an intermediate model?
- From the documentation it seems to be good practice to have the layers in the top level with specific subfolders. Do I understand this correctly? I could imaging building the structure the other way around since you then have everything related to 1 topic in one place. ie
models/staging/topic
models/intermediate/topic
models/marts/topic
versus
models/topic/staging
models/topic/intermediate
models/topic/marts
- how do you handle/name different levels of models that can be considered the ‘output’ of the data pipeline? I guess these are considered to be marts, but in this case I would have one I currently call
source__*
as it contains all the data from the previous steps and this is sometimes used by myself and a few other people for analyses. Then there is a more commonly used model that applies some filtering and the last interpretation logic on the source. This table is used more by other people. - then lastly there will be yet more aggregated tables build on top of the last model from the pipeline, these will be used for dashboarding mostly. This is what might be considered a presentation layer, but I seem to be missing this concept within dbt?
My apologies if I’m mixing different data modelling terms, trying to understand how this would all work in an ‘ideal dbt’ world to prevent more complexity by mixing different modelling and naming conventions approaches.
This data pipeline is quite complex and even I (who build it in the first place) find it quite challenging to debug different parts, so I want to improve this while porting it all over to dbt. That’s why I appreciate all input on how you handle these kind of things