Using macros to handle concatenation of possibly-null fields


#1

Note: The DB in use here is BigQuery

I did a dumb thing 10 months ago. I was converting a python ETL script into SQL, and one of the steps builds a nested address record. Into this nested record goes the address from the doctor record, and related records from the address table, where the address in the address table does not match the address in the doctor table.

I handled this with a doctor_addresses model file which filters the doctor+address dataset to only those where at least one of the address component fields didn’t match the corresponding field on the doctor record. Here’s what I did:

select
  d.dr_id
  ,d.address
  ,d.address2
  ,d.city
  ,d.state
  ,d.zipcode
  ,d.country
  ,d.latitude
  ,d.longitude
from {{ref('doctor')}} d
union distinct
select
  d.dr_id
  ,a.line1 as address
  ,a.line2 as address2
  ,a.city
  ,a.state_region as state
  ,a.postal_code as zipcode
  ,a.country_code as country
  ,a.latitude
  ,a.longitude
from {{ref('doctor')}} d
inner join {{ref('address')}} a
  on  d.dr_id = a.dr_id
where concat(d.address, d.address2, d.city, d.state, d.zipcode, d.country, d.latitude, d.longitude)
  != concat(d.line1, d.line2, d.city, d.state_region, d.postal_code, d.country_code, d.latitude, d.longitude)

The dumb thing was not accounting for the possibility that one of more of the address component fields might be null, and when that happens, the concat() function returns null. And of course doing a != between null and anything (including null) returns null. So when any of the considered columns in either table was null, the second query of the union did nothing.

How embarrassing…

To resolve this, I need to cast the values as strings, then replace nulls with empty strings, then concatenate. That where clause would look like this:

select ...
where concat(
  ifnull(safe_cast(d.address as string), ''),
  ifnull(safe_cast(d.address2 as string), ''),
  ifnull(safe_cast(d.city as string), ''),
  ifnull(safe_cast(d.state as string), ''),
  ifnull(safe_cast(d.zipcode as string), ''),
  ifnull(safe_cast(d.country as string), ''),
  ifnull(safe_cast(d.latitude as string), ''),
  ifnull(safe_cast(d.longitude as string), '')
  ) != concat(
  ifnull(safe_cast(d.line1 as string), ''),
  ifnull(safe_cast(d.line2 as string), ''),
  ifnull(safe_cast(d.city as string), ''),
  ifnull(safe_cast(d.state_region as string), ''),
  ifnull(safe_cast(d.postal_code as string), ''),
  ifnull(safe_cast(d.country_code as string), ''),
  ifnull(safe_cast(d.latitude as string), ''),
  ifnull(safe_cast(d.longitude as string), '')
  )

Yuck. It’s ugly to look at, and there are lots of opportunities for misplaced commas, quotes, and parens.

The group_by() macro in the dbt macro documentation, however, gave me the idea to build those concat()s dynamically. It iterates through a list of passed field names, generating the concat() statement on the fly.

{% macro safe_concat(field_list) %}
  {# Takes an input list and generates a concat() statement with each argument in the list safe_casted to a string and wrapped in an ifnull() #}
  concat({% for f in field_list %}
    ifnull(safe_cast({{ f }} as string), '')
    {% if not loop.last %}, {% endif %}
  {% endfor %})
{% endmacro %}

By using the macro, the where clause becomes quite approachable again:

select ...
where {{ safe_concat(['d.address','d.address2','d.city','d.state','d.zipcode','d.country','d.latitude','d.longitude']) }}
  != {{ safe_concat(['a.line1','a.line2','a.city','a.state_region','a.postal_code','a.country_code','a.latitude','a.longitude']) }}

The compiled SQL has a bunch of extraneous line breaks, as is typical with macro-generated code, but it’s a small price to pay for the readability of both the model file and the macro code.


#2

Love this. As I’ve personally gotten a lot more comfortable with / in the habit of thinking in Jinja, I’ve started to see more and more opportunities like this arise. It’s actually quite common to want to do the same thing to an entire list of fields, and it’s much more concise / DRY to input them as a list and then loop through all of the elements in that list. I’ve done this with coalesces in the way that you show above, but also very frequently with last_value() window functions when I want to dedup some dataset.

The longer I work with Jinja + SQL, the more Jinja I find that I write :smiley: Glad to see you’re finding the same…


#3

As we continue to adapt into the dbt way of thinking, one of the things I and the analysts who use my DW struggle to wrap our heads around is how to have one source for truth for a lot of logic we use in our transformations. User bucketing is one example. Sure, we could have a model that has user and a column that contains the logic for the bucketing, but that’s not intuitive for us yet, and having a bunch of models like that running around feels clunky. But if they’re macros, it’s very intuitive, because we can think about it just like we’d think about a function in SQL or Python instead of thinking about it like a dataset.

It starts to break down the mental wall that SQL code isn’t software, which is a great development for the space.