as promised! the full details from this dbt slack thread … hopefully useful. had to cut all but the link above because i am only allowed 2 links for my post
it’s a lot and i’m probably exposing some of my bad dbt/coding practices. hoping some peeps can kindly chime in with improvements/modifications so i can update for future dbt’ers who stumble upon this thread!
Background
- Current state:
- Table that has ~20 “reserved” columns (they are always non-null) …
- … and 50+ sparse columns (almost always null)
- Goal:
- “JSON-ify” the 50+ sparse columns into a single column
- Solution:
- Use a combination of (a) jinja in our model and (b) a macro (json_parse/SUPER) to make that transformation
Solution — Overview
- (1) identify the reserved columns
- (2) write some jinja to get all source columns + set our reserved columns
- (3) macro that converts the sparse columns into a single ( beautifully concise ) column
- (4) use that macro in our model!
- (5) test
- (6) query and delight
- note: there’s probably a better and cleaner way to do (2) through (4)
Solution — Details
-
(1) identify the reserved columns:
- pretty much brute force here, but this is what we did
- grabbed all the columns from our source table
- put them in a google sheet and marked them
is_reserved (Y/N)
- pretty much brute force here, but this is what we did
-
(2) write some jinja to get all table columns + set our reserved columns
- code (inside the model that is pulling from the reserved + sparse columns):
{%- set source_relation = adapter.get_relation(
database="xxx",
schema="yyy",
identifier="zzz") -%}
{%- set columns = adapter.get_columns_in_relation(source_relation) -%}
{%- set reserved_columns = [
"column_a",
, "column_b"
,...
, "column_n"
] %}
- (3) macro that converts the sparse columns into a single (beautifully concise) column
- note on shortcomings:
- added the condition
column.is_string()
because we had some json columns that caused bracket
mismatches - had to hard-code the last key-val pair in our json (
"_success":1
) because otherwise we risked a trailing comma (they make redshift puke) - didn’t get the spacing for compiled code right
- added the condition
- code (new file in our
/macros
directory):
- note on shortcomings:
{%- macro convert_to_json(all_columns, reserved_columns) -%}
json_parse(
'{'
{% for column in all_columns %}
{%- if column.name not in reserved_columns and column.is_string() -%}
|| nvl({{"'" ~ column.quoted ~ "'" }} || ':"' || {{ column.name }}::varchar || '", ', '')
{%- endif -%}
{% endfor %}
|| {{ "'" ~ '"_success": 1' ~ "'" }}
||
'}') as our_json_beauty
{%- endmacro -%}
- (4) use that macro in our model!
select
column_a
, column_b
, ...
, column_n
-- we set both of these args in step (2); our macro is the one from step (3)
, {{ convert_to_json(columns, reserved_columns) }}
from foo
-
note: if you want, you could add selecting the reserved columns (a through n) to your macro. i didn’t because i didn’t want that.
-
(5) test
- admittedly, i’m a total noob here, but these are the basic checks i ran:
- check your compiled code to make sure the macro spit out what you expected it to
- make sure we didn’t add any rows
- run some spot checks to confirm your sparse columns got json-ified
select our_json_beauty from table where sparse_column is not null limit 10
- admittedly, i’m a total noob here, but these are the basic checks i ran:
-
(6) query and delight
select
our_json_beauty.sparse_column as sparse_column
from foo
where sparse_column = 'bar'
big ups if you made it through all the above! hope this was helpful