Last week I was fixing a bug in our internal analytics project, in a model which groups our employees by age range for our DEI reporting. A dashboard started failing because someone had their 25th birthday, meaning we no longer had anyone in the 18-24 age bracket.
To fix it, I stopped using dbt_utils’ get_column_values macro and instead added a new macro to our project which knew how to return both the logic to generate the age buckets and the full list of all possible age buckets, regardless of whether anyone was in that cohort at the time.
The raw data
We start with a table that looks like this:
employee_id | … | age |
---|---|---|
ABC123 | … | 25 |
DEF456 | … | 32 |
HIJ789 | … | 49 |
… | … | … |
XYZ890 | … | 66 |
The initial transformation
This case statement buckets employees:
Sample case code
case
when age >= 18 and age < 25
then '18 to 24'
when age >= 25 and age < 35
then '25 to 34'
when age >= 35 and age < 45
then '35 to 44'
when age >= 45 and age < 55
then '45 to 54'
when age >= 55 and age < 65
then '55 to 64'
when age >= 65
then '65+'
end as age_bucket
Then we were pivoting them, in conjunction with dbt_utils.get_column_values()
to get the possible values:
Sample pivot code
Heads up that you could get most of this for free by using dbt_utils.pivot()
instead, but we had some more complex needs that I’m glossing over.
{% set attribute_cols = ['ethnicity', 'age_bucket', 'etc'] %}
select
{% for attribute_col in attribute_cols %}
{%- set col_values = dbt_utils.get_column_values(
table = ref('int_dei_align_person_attributes'),
column = attribute_col
) -%}
{% set columns_to_summarise = [] %}
{# Iterate through column values to create sum statements #}
{%- for col_value in col_values -%}
{% set col_name = attribute_col ~ '__' ~ dbt_utils.slugify(col_value) %}
{% do columns_to_summarise.append(col_name) %}
sum(
case
when {{ attribute_col }} = '{{ col_value }}'
then 1
else 0
end
) as {{ col_name }},
{%- endfor %}
{{ columns_to_summarise | join(" + ") }} as {{ attribute_col }}__total_responses{% if not loop.last %},{% endif %}
{%- endfor -%}
which results in
age_bucket___25_to_34 | age_bucket___45_to_54 | … | age_bucket___65 |
---|---|---|---|
2 | 1 | … | 1 |
Children get older, I’m getting older too
This model used a lot of dynamic column generation to handle all the pivoting (all up, there are >50 variants across different characteristics), but the column names are hardcoded in Looker. This means that when employee ABC123 above had their 25th birthday:
- there was no longer anyone in the 18-24 group
- so dbt didn’t create a
age_bucket___18_to_24
column - and Looker panicked.
Avoiding divergence
To protect against this, we needed to ensure that all possible buckets were always represented, while also preventing the code from diverging. The best solution would be to populate the col_values
list with the possible values from the case statement above, but we still don’t want to be repeating configuration. Hardcoding them into the model isn’t an option.
Solving the problem with a macro
Macros let us reuse code in dbt. Given a list of age boundaries, it’s trivial to generate 18-24, 25-35, and so on. This would be overkill if all we needed was the case statement, but with an extra flag in the macro signature I can also reuse its output as exactly the list of values I’m after.
Here’s the code:
{% macro generate_adp_age_buckets(return_bracket_names=False) %}
{%- set age_boundaries = [18, 25, 35, 45, 55, 65] %}
{%- set bracket_names = [] %}
case
{%- for age in age_boundaries %}
{%- if not loop.last %}
{%- set next_age = age_boundaries[loop.index0 + 1] %}
{%- set range_name = age ~ ' to ' ~ (next_age - 1) %}
{%- do bracket_names.append(range_name) %}
when age >= {{ age }} and age < {{ next_age }}
then '{{ range_name }}'
{%- else %}
{%- set range_name = age ~ '+' %}
{%- do bracket_names.append(range_name) %}
when age >= {{ age }}
then '{{ range_name }}'
{%- endif %}
{%- endfor %}
end
{%- if return_bracket_names %}
{%- do return(bracket_names) %}
{%- endif %}
{%- endmacro %}
What are the outputs?
Depending on how it’s invoked, you get two outputs:
select
{{ generate_adp_age_buckets() }} as age_buckets
from {{ ref('my_model') }}
renders to the same case statement as originally shown above. Since it doesn’t hit the do return
block, it outputs the templated string.
When the macro does hit the do return
block (caused by the return_bracket_names=True
branch):
{% set col_values = generate_adp_age_buckets(return_bracket_names=True) %}
{{ col_values }}
the macro outputs a simple list:
['18 to 24', '25 to 34', '35 to 44', '45 to 54', '55 to 64', '65+']
which is perfect for inclusion in the slugify macro shown in the pivot example above.
Is this a good idea?
Only just. The docs on macros warn that:
Once you learn the power of Jinja, it’s common to want to abstract every repeated line into a macro! Remember that using Jinja can make your models harder for other users to interpret — we recommend favoring readability when mixing Jinja with SQL, even if it means repeating some lines of SQL in a few places. If all your models are macros, it might be worth re-assessing.
In this case, the heavy lifting in the model is mostly macros, because of how dynamic the columns need to be. Let’s consider the alternatives:
- Manually copy-pasting 50+
sum(case when...)
statements would be uglier and subject to errors that would be harder to track down. - Likewise, putting the boundaries in two different files would rely on a comment saying
/* IF YOU CHANGE THIS FILE, ALSO CHANGE THE OTHER FILE */
which isn’t safe.
Given that, swapping out a piece of code that was a bit too clever with something more explicit is a win IMO, and hits a balance between magic and redundancy. I won’t be using this pattern regularly, but it’s nice to have it on hand if I need it again one day.