How to conditionally return either a Jinja list or SQL from a dbt macro

Last week I was fixing a bug in our internal analytics project, in a model which groups our employees by age range for our DEI reporting. A dashboard started failing because someone had their 25th birthday, meaning we no longer had anyone in the 18-24 age bracket.

To fix it, I stopped using dbt_utils’ get_column_values macro and instead added a new macro to our project which knew how to return both the logic to generate the age buckets and the full list of all possible age buckets, regardless of whether anyone was in that cohort at the time.

The raw data

We start with a table that looks like this:

employee_id age
ABC123 25
DEF456 32
HIJ789 49
XYZ890 66

The initial transformation

This case statement buckets employees:

Sample case code
    when age >= 18 and age < 25 
        then '18 to 24'
    when age >= 25 and age < 35 
        then '25 to 34'
    when age >= 35 and age < 45 
        then '35 to 44'
    when age >= 45 and age < 55 
        then '45 to 54'
    when age >= 55 and age < 65 
        then '55 to 64'
    when age >= 65
        then '65+'
end as age_bucket

Then we were pivoting them, in conjunction with dbt_utils.get_column_values() to get the possible values:

Sample pivot code

Heads up that you could get most of this for free by using dbt_utils.pivot() instead, but we had some more complex needs that I’m glossing over.

{% set attribute_cols = ['ethnicity', 'age_bucket', 'etc'] %}


{% for attribute_col in attribute_cols %}
    {%- set col_values = dbt_utils.get_column_values(
        table = ref('int_dei_align_person_attributes'),
        column = attribute_col
    ) -%}
    {% set columns_to_summarise = [] %}

    {# Iterate through column values to create sum statements #}
    {%- for col_value in col_values -%}
      {% set col_name = attribute_col ~ '__' ~ dbt_utils.slugify(col_value) %}
      {% do columns_to_summarise.append(col_name) %}
            when {{ attribute_col }} = '{{ col_value }}'
              then 1
            else 0
        ) as {{ col_name }},
    {%- endfor %}

    {{ columns_to_summarise | join(" + ") }} as {{ attribute_col }}__total_responses{% if not loop.last %},{% endif %}
{%- endfor -%}

which results in

age_bucket___25_to_34 age_bucket___45_to_54 age_bucket___65
2 1 1

Children get older, I’m getting older too

This model used a lot of dynamic column generation to handle all the pivoting (all up, there are >50 variants across different characteristics), but the column names are hardcoded in Looker. This means that when employee ABC123 above had their 25th birthday:

  • there was no longer anyone in the 18-24 group
  • so dbt didn’t create a age_bucket___18_to_24 column
  • and Looker panicked.

Avoiding divergence

To protect against this, we needed to ensure that all possible buckets were always represented, while also preventing the code from diverging. The best solution would be to populate the col_values list with the possible values from the case statement above, but we still don’t want to be repeating configuration. Hardcoding them into the model isn’t an option.

Solving the problem with a macro

Macros let us reuse code in dbt. Given a list of age boundaries, it’s trivial to generate 18-24, 25-35, and so on. This would be overkill if all we needed was the case statement, but with an extra flag in the macro signature I can also reuse its output as exactly the list of values I’m after.

Here’s the code:

{% macro generate_adp_age_buckets(return_bracket_names=False) %}

{%- set age_boundaries = [18, 25, 35, 45, 55, 65] %}
{%- set bracket_names = [] %}
{%- for age in age_boundaries %}
    {%- if not loop.last %}
        {%- set next_age = age_boundaries[loop.index0 + 1] %}
        {%- set range_name = age ~ ' to ' ~ (next_age - 1) %}
        {%- do bracket_names.append(range_name) %}
        when age >= {{ age }} and age < {{ next_age }} 
            then '{{ range_name }}'
    {%- else %}
        {%- set range_name = age ~ '+' %}
        {%- do bracket_names.append(range_name) %}
        when age >= {{ age }}
            then '{{ range_name }}'
    {%- endif %}
{%- endfor %}

{%- if return_bracket_names %}
    {%- do return(bracket_names) %}
{%- endif %}

{%- endmacro %}

What are the outputs?

Depending on how it’s invoked, you get two outputs:

  {{ generate_adp_age_buckets() }} as age_buckets
from {{ ref('my_model') }}

renders to the same case statement as originally shown above. Since it doesn’t hit the do return block, it outputs the templated string.

When the macro does hit the do return block (caused by the return_bracket_names=True branch):

{% set col_values = generate_adp_age_buckets(return_bracket_names=True) %}
{{ col_values }}

the macro outputs a simple list:

['18 to 24', '25 to 34', '35 to 44', '45 to 54', '55 to 64', '65+']

which is perfect for inclusion in the slugify macro shown in the pivot example above.

Is this a good idea?

Only just. The docs on macros warn that:

Once you learn the power of Jinja, it’s common to want to abstract every repeated line into a macro! Remember that using Jinja can make your models harder for other users to interpret — we recommend favoring readability when mixing Jinja with SQL, even if it means repeating some lines of SQL in a few places. If all your models are macros, it might be worth re-assessing.

In this case, the heavy lifting in the model is mostly macros, because of how dynamic the columns need to be. Let’s consider the alternatives:

  • Manually copy-pasting 50+ sum(case when...) statements would be uglier and subject to errors that would be harder to track down.
  • Likewise, putting the boundaries in two different files would rely on a comment saying /* IF YOU CHANGE THIS FILE, ALSO CHANGE THE OTHER FILE */ which isn’t safe.

Given that, swapping out a piece of code that was a bit too clever with something more explicit is a win IMO, and hits a balance between magic and redundancy. I won’t be using this pattern regularly, but it’s nice to have it on hand if I need it again one day.