Redshift: converting sparse columns into a single json (super) column

as promised! the full details from this dbt slack thread … hopefully useful. had to cut all but the link above because i am only allowed 2 links for my post

it’s a lot and i’m probably exposing some of my bad dbt/coding practices. hoping some peeps can kindly chime in with improvements/modifications so i can update for future dbt’ers who stumble upon this thread!

Background

  • Current state:
    • Table that has ~20 “reserved” columns (they are always non-null) …
    • … and 50+ sparse columns (almost always null)
  • Goal:
    • “JSON-ify” the 50+ sparse columns into a single column
  • Solution:
    • Use a combination of (a) jinja in our model and (b) a macro (json_parse/SUPER) to make that transformation

Solution — Overview

  • (1) identify the reserved columns
  • (2) write some jinja to get all source columns + set our reserved columns
  • (3) macro that converts the sparse columns into a single ( beautifully concise ) column
  • (4) use that macro in our model!
  • (5) test
  • (6) query and delight
  • note: there’s probably a better and cleaner way to do (2) through (4)

Solution — Details

  • (1) identify the reserved columns:

    • pretty much brute force here, but this is what we did
      • grabbed all the columns from our source table
      • put them in a google sheet and marked them is_reserved (Y/N)
  • (2) write some jinja to get all table columns + set our reserved columns

    • code (inside the model that is pulling from the reserved + sparse columns):
{%- set source_relation = adapter.get_relation(
  database="xxx",
  schema="yyy",
  identifier="zzz") -%}
{%- set columns = adapter.get_columns_in_relation(source_relation) -%}
{%- set reserved_columns = [
  "column_a",
  , "column_b"
  ,...
  , "column_n"
] %}
  • (3) macro that converts the sparse columns into a single (beautifully concise) column
    • note on shortcomings:
      • added the condition column.is_string() because we had some json columns that caused bracket
        mismatches
      • had to hard-code the last key-val pair in our json ( "_success":1 ) because otherwise we risked a trailing comma (they make redshift puke)
      • didn’t get the spacing for compiled code right
    • code (new file in our /macros directory):
{%- macro convert_to_json(all_columns, reserved_columns) -%}
json_parse(
    '{'
{% for column in all_columns %}
  {%- if column.name not in reserved_columns and column.is_string() -%}
    || nvl({{"'" ~ column.quoted ~ "'" }} || ':"' || {{ column.name }}::varchar || '", ', '')
  {%- endif -%}
{% endfor %}
    || {{ "'" ~ '"_success": 1' ~ "'" }}
    ||
    '}') as our_json_beauty
{%- endmacro -%}
  • (4) use that macro in our model!
select
  column_a
  , column_b
  , ...
  , column_n
  -- we set both of these args in step (2); our macro is the one from step (3)
  , {{ convert_to_json(columns, reserved_columns) }}
from foo
  • note: if you want, you could add selecting the reserved columns (a through n) to your macro. i didn’t because i didn’t want that.

  • (5) test

    • admittedly, i’m a total noob here, but these are the basic checks i ran:
      • check your compiled code to make sure the macro spit out what you expected it to
      • make sure we didn’t add any rows
      • run some spot checks to confirm your sparse columns got json-ified
        • select our_json_beauty from table where sparse_column is not null limit 10
  • (6) query and delight

select
  our_json_beauty.sparse_column as sparse_column
from foo
where sparse_column = 'bar'

big ups if you made it through all the above! hope this was helpful

1 Like

some of the resources i wanted to include:

hi! amazing writeup thank you! as I mentioned in slack, to solve the trailing comma issue, you could use a replace to pre-parse the generated JSON.

JSON_PARSE(
    REPLACE(
        '{' 
            || ... || NVL('"sparse_column_N":"||sparse_col_N||'"', '')
            ||
            '}',
        ', }',
        '')
    )
1 Like

this is perfect! updated slightly just so the third arg provides a closing }:

JSON_PARSE(
    REPLACE(
        '{' 
            || ... || NVL('"sparse_column_N":"||sparse_col_N||'"', '')
            ||
            '}',
        ', }',
        '}') --replace with a closing curly brace
    )