Experts - new guy question here, so I apologize in advance for any eye rolls I cause.
I’ve been digging into dbt best practices, and all the example are overly simplified - so I’m wondering if this is the correct way to implement my situation. My transformations are VERY involved, multi-step processes, and I DONT want to cram them into thousands of lines of SQL with many CTEs. Per dbt’s approach, I would like to break them up in to multiple modules or models which each accomplish a single-ish goal. My specific question, is all the examples I’m seeing are so simple that I don’t know if the below is the correct approach. What do you think?
For the below assume layer/tier of your choice - either stage, refined, or mart tier. The tier this happens in is a different question, I’m scoping my question to how you construct multi-model pipelines in dbt.
Phase 0:
- pipeline_name_0_variables.sql: create a table to hold global variables you wish to pass to each subsequent model. This is because dbt is stateless and models connected with the REF link cannot actually share variables. Example variable would be something like “ETL_INSTANCE_ID”. Table is stg_pipeline_name_0_variables
Phase 1:
- pipeline_name_1_src: this phase consumes source tables from a upstream tier (either curated or modeled) and prepares them for the pipeline. We test freshness of source table, handle null values, rename fields, and do the first data prep - for example joining header-item tables into a single table. We output our data to stage table stg_pipeline_name_1_src. This is incremental load, so small data volume, and stage table is overwritten each run. Note: we read stg_pipeline_name_0_vars and can write to it to store needed metadata.
Phase 2.0 - Phase 2.X
- assume same info as above, except we are implementing complicated transformations. Depending on the complexity these might be implemented in separate models (hence 2.X). The point is to make these more understandable by breaking them up. Each model reads from stage 0 table for global variables (pre-hook), reads from stg_pipeline_name_0_vars (previous stage of pipeline) and write to its own stage table (2.X).
Phase 3:
- dim look up / handle late arriving dim’s / add descriptions for all codes. Reads from *_vars, and *_stg_2 table (previous pipeline table). Write to stage 3 table.
Phase 4:
- load target. Uses pre-hook for variables from stg_0, and REF to link to previous stage table (stg_3).
The idea here is I would create a “pipeline” with the target table as the pipeline name. Then, folders per pipeline, and a model for each phase/stage. So in this example, 5+ models all chained together.
Again, all the internet examples are so simplified they don’t show this. Am I doing this wrong - or does this make sense? Assume large Fortune 500 company / data environment.
Thanks in advance - and sorry for the length…
NOTE: I also have upstream tiers like raw, curated, modeled, intermediary, reference, and various marts. When appropriate I will move reusable data/transformations upstream. For this question I am focused on a pipeline in whatever tier that requires multiple models due to its complexity.
The problem I’m having
I can’t find a example on the internet, books, YouTube, best practices documentation that shows a pipeline which is sufficiently complicated that it needs more than 1 model.
The context of why I’m trying to do this
I’m trying to create patterns/templates for my organization, and I want to break up very complicated transformations / pipelines into manageable steps/phases where each model accomplishes a smaller number of objectives, but the overall pipeline achieves a very complicated data transformation.
What I’ve already tried
Books, documentation, YouTube, best practices guide. All use overly simplistic examples, they don’t chain multiple models together to form a singular pipeline. They state to move transformations upstream, which I’m doing - but that isn’t my question. Regardless of what “layer” you are working in, you might still have a very-very complicated transformation that you want to break up into multiple models (not dozens of CTEs). So at some point, regardless of where in your warehouse architecture, you will want to chain models to form a singular “pipeline” (I think).
Again - I don’t know the answer here, or the best practice, I’m just looking at what I did with other technologies / ETL situations and trying to apply to dbt development methodology. I’m asking the experts for guidance here. Thanks again for anything you can add.
Edit: My mindset is trying to convert my legacy ETL loads to the dbt-snowflake methodologies. So, if it helps you understand where I got this “staged” ideology from, it’s just the legacy ETL methodologies we have implemented with SSIS or SQL stored procedures.