Redshift: converting sparse columns into a single json (super) column

lucash · July 26, 2021, 2:55am

as promised! the full details from this dbt slack thread … hopefully useful. had to cut all but the link above because i am only allowed 2 links for my post

it’s a lot and i’m probably exposing some of my bad dbt/coding practices. hoping some peeps can kindly chime in with improvements/modifications so i can update for future dbt’ers who stumble upon this thread!

Background

Current state:
- Table that has ~20 “reserved” columns (they are always non-null) …
- … and 50+ sparse columns (almost always null)
Goal:
- “JSON-ify” the 50+ sparse columns into a single column
Solution:
- Use a combination of (a) jinja in our model and (b) a macro (json_parse/SUPER) to make that transformation

Solution — Overview

(1) identify the reserved columns
(2) write some jinja to get all source columns + set our reserved columns
(3) macro that converts the sparse columns into a single ( beautifully concise ) column
(4) use that macro in our model!
(5) test
(6) query and delight
note: there’s probably a better and cleaner way to do (2) through (4)

Solution — Details

(1) identify the reserved columns:
- pretty much brute force here, but this is what we did
  - grabbed all the columns from our source table
  - put them in a google sheet and marked them is_reserved (Y/N)
(2) write some jinja to get all table columns + set our reserved columns
- code (inside the model that is pulling from the reserved + sparse columns):

{%- set source_relation = adapter.get_relation(
  database="xxx",
  schema="yyy",
  identifier="zzz") -%}
{%- set columns = adapter.get_columns_in_relation(source_relation) -%}
{%- set reserved_columns = [
  "column_a",
  , "column_b"
  ,...
  , "column_n"
] %}

(3) macro that converts the sparse columns into a single (beautifully concise) column
- note on shortcomings:
  - added the condition column.is_string() because we had some json columns that caused bracket
    mismatches
  - had to hard-code the last key-val pair in our json ( "_success":1 ) because otherwise we risked a trailing comma (they make redshift puke)
  - didn’t get the spacing for compiled code right
- code (new file in our /macros directory):

{%- macro convert_to_json(all_columns, reserved_columns) -%}
json_parse(
    '{'
{% for column in all_columns %}
  {%- if column.name not in reserved_columns and column.is_string() -%}
    || nvl({{"'" ~ column.quoted ~ "'" }} || ':"' || {{ column.name }}::varchar || '", ', '')
  {%- endif -%}
{% endfor %}
    || {{ "'" ~ '"_success": 1' ~ "'" }}
    ||
    '}') as our_json_beauty
{%- endmacro -%}

(4) use that macro in our model!

select
  column_a
  , column_b
  , ...
  , column_n
  -- we set both of these args in step (2); our macro is the one from step (3)
  , {{ convert_to_json(columns, reserved_columns) }}
from foo

note: if you want, you could add selecting the reserved columns (a through n) to your macro. i didn’t because i didn’t want that.
(5) test
- admittedly, i’m a total noob here, but these are the basic checks i ran:
  - check your compiled code to make sure the macro spit out what you expected it to
  - make sure we didn’t add any rows
  - run some spot checks to confirm your sparse columns got json-ified
    - select our_json_beauty from table where sparse_column is not null limit 10
(6) query and delight

select
  our_json_beauty.sparse_column as sparse_column
from foo
where sparse_column = 'bar'

big ups if you made it through all the above! hope this was helpful

lucash · July 26, 2021, 2:58am

some of the resources i wanted to include:

Redshift docs: json_parse, SUPER

lucash · July 26, 2021, 2:58am

dbt classes: relations, columns

lucash · July 26, 2021, 2:59am

dbt jinja functions: adapter.get_relation, adapter.get_columns_in_relation

drop_database_public · July 26, 2021, 7:28pm

hi! amazing writeup thank you! as I mentioned in slack, to solve the trailing comma issue, you could use a replace to pre-parse the generated JSON.

JSON_PARSE(
    REPLACE(
        '{' 
            || ... || NVL('"sparse_column_N":"||sparse_col_N||'"', '')
            ||
            '}',
        ', }',
        '')
    )

lucash · July 27, 2021, 5:20pm

this is perfect! updated slightly just so the third arg provides a closing }:

JSON_PARSE(
    REPLACE(
        '{' 
            || ... || NVL('"sparse_column_N":"||sparse_col_N||'"', '')
            ||
            '}',
        ', }',
        '}') --replace with a closing curly brace
    )

Topic		Replies	Views
Splitting arrays into rows in Redshift with dbt macros Archive jinja , redshift	1	6292	May 14, 2019
Auto pivot JSON string using schema.yml in BigQuery Show and Tell jinja , pivot	0	5546	September 8, 2020
Using fromjson for JSON files Archive	0	2591	September 6, 2020
Why is DBT changing datatypes, column lengths and column order Archive	1	3157	December 31, 2021
loading seed with SUPER type to Redshift Help seeds , redshift	1	1115	January 4, 2023

Redshift: converting sparse columns into a single json (super) column

Related topics