When in the DAG should denormalization of codes into string occur?

We have been provided some source tables that contain multiple coded columns. Those columns contain integers that map to a single string value in a “main dictionary table”.

Using the codes is not human friendly and eventually, the codes are going to be converted into Strings. Also, the strings in the main dictionary table do not change.

My question is: When should this denormalization occur?

On one hand, if we do it early we will be making sure that we only have a “single source of truth” but the transformations are really slow (there is a join per column!) and I’m worried that the performance of our pipeline is going to be severely affected. (We have to do this for a lot of tables)

On the other hand, if we don’t do it, then eventually the analytics teams will either be dealing with the codes (which is annoying) or have to convert them themselves, and that will create a lot of code duplication and readability issues.

Can anyone point me to any resource to learn about this? Or does anyone have any input to provide?

1 Like

Great question! Going by our guide on how we structure our dbt projects, I would encourage you to do this in the staging model.

The guide advises against joins in most staging cases, but has an exceptions section at the bottom which talks about using a “base” model first when joins are necessary to make sense of the data at all.

Your instincts here are spot on. If they’re ultimately going to have to convert this downstream, then any performance issues are just going to happen there instead - in other words, if you have to pay the price regardless, you might as well at least get a nice UX out of it.

Perhaps you will want to materialize this as a table to “bake in” the transformations so that you pay the performance penalty once and then it’s nice and quick and tidy for all downstream consumers.

I think that’s the approach we are going to take. We will make those models that denormalize the tables into our staging layer, with the models materialized as tables. Then, we are going to create a “Slim Deployment” as explained here.

Thank you!

1 Like