Hey dbt Community,
We are working to stage some data where a lot of the columns contain codes that lookup to values either in other tables (primary key/foreign key) or to values that exist in documentation (for these codes, we copy all the code value pairs into a CSV and load them as a seed). Our initial strategy was to bring in the code values via base models in order to translate them directly in the staging models. The problem with this is, if we every wanted to left join to all possible values, we’d have to use the base model outside of a staging model which rubs me the wrong way.
Have any of you developed any best practices or design patterns around code/lookup tables? Right now we’re leaning towards creating an exception to the “no joins in staging models” rule just for these types of situations but I’m curious what you all have done.
Thanks!
Connor
edit: Found this community post after submitting (guess I wasn’t using the correct terms during my initial search). Looks like the suggested method is using base models but that still doesn’t account for the “left joining to all possible values” problem.
@joellabes You still active in these streets?
My advice here is to not follow convention for conventions sake. If your business/data models require (in your opinion) a specific layout that may not be the general best practice, doesn’t mean it isn’t the best practice for you.
We have joins in our src layers (analogous with the staging layer in your scenario) for making sure that all of our primary keys are aligned and correct to unblock redundant joins etc down the line.
Note: @Cooper White originally posted this reply in Slack. It might not have transferred perfectly.
Thanks Cooper. That’s what we’re leaning towards. Have you experienced any downstream inefficiencies with having these joins so early in the DAG? Do you materialize your src layer as Views? Tables? We don’t mind straying from convention but we have less appetite to do so at the cost of performance
Really depends on your volume and the kind of join to be honest.
We have some source models that are just views doing casting and renaming, we have some that are full load tables doing a join or two, and we have some set up as incremental.
Best way to explore is just to give it a go in my opinion
Note: @Cooper White originally posted this reply in Slack. It might not have transferred perfectly.