Creating a "model version" column in your models using md5

I was faced with an interesting question recently, which I will paraphrase as:

“Is there a way for me to have a column in my views/tables that changes every time the model is modified without me having to remember to update a version number in the code?”

This could be potentially useful if you wanted to be able to understand if the version of a table being queried by a dashboard is the same as previous versions AND could be used in something like a datagroup sql_trigger in Looker to trigger a refresh when a dependent table is updated as a result of a model change.

Using the model Jinja function in dbt the answer is YES but with some caveats:

  • raw_sql is pre-Jinja rendering/compilation – so even if the Jinja renders something different, the raw_sql won’t change. That also means a change to the definition of a macro your calls won’t be reflected even though different SQL might be generated.
  • Relatively minor changes that result in the SAME sql will result in a model version change. So if you adjust the whitespace in your model file
{% set model_sql = model %}
with

dbt_model_version as (
  SELECT MD5('{{ model_sql.raw_sql | e }}') as _dbt_model_version
),

customers as (
  select * from {{ source('jaffle_shop','customers') }}
),

my_model as (
    select 
        id as customer_id,
        first_name,
        last_name,
        first_name || ' ' || last_name as full_name
    from customers
)

select * from my_model
CROSS JOIN dbt_model_version

A few additional notes:

  • I wrote and tested this code on Snowflake. It should work on other dialects, but I would encourage you to test and report back.
  • Wait…CROSS JOIN…really? My original version inserted the MD5('{{ my_sql.raw_sql | e }}') as _dbt_model_version expression into the my_model part of the query. After some questions from colleagues, I elected not to do that as:
    • the performance characteristics were effectively identical and,
    • it makes this much more copy/paste friendly as you can add code to the very beginning and very end of your models without really modifying your model code itself vs. having to get the MD5 expression inserted into exactly the right place.

Questions? Feedback? Ideas for derivative works? Join in the discussion by commenting below!

5 Likes

Neat! I could see applications of this for auditing dbt snapshots of models so you could track how the data might have changed in relation to the code that generated it in addition to just the source data stream. This would be extra useful in this case if you could tie those hashes to a git commit sha, or maybe just another table that stored a version of the sql text along with the md5 hash.