I’m not sure if this is the appropriate place to ask; please feel free to direct me somewhere else!
The problem I’m having
I have been struggling to convert a categorical column (containing a varying number of comma separated strings) into binary columns for use in classification algorhythms.
The context of why I’m trying to do this
This column has roughly 5000 unique values and I may be doing this to other columns as well, so I am seeking a dynamic approach
What I’ve already tried
Chat GPT and I have tried a number of various macros and this is the iteration I’m currently on:
{% macro binarize_column(column, prefix) %}
WITH split_data AS (
SELECT
LOWER({{ column }}) AS {{ column }},
REGEXP_SPLIT_TO_TABLE(LOWER({{ column }}), ',') AS split_column
FROM {{ ref('my_table') }}
),
unique_data AS (
SELECT DISTINCT split_column
FROM split_data
),
binned_data AS (
SELECT
{{ column }},
{% for row in unique_data %}
CASE
WHEN {{ row.split_column }} IS NOT NULL THEN 1
ELSE 0
END AS {{ prefix }} || '_' || {{ row.split_column }}
{% if not loop.last %},{% endif %}
{% endfor %}
FROM split_data
)
SELECT *
FROM binned_data;
{% endmacro %}
and calling it via:
{{ config(materialized='table') }}
{{ binarize_column('my_column', 'my_prefix') }}
This is the resulting error message when trying to dbt run
22:34:35.609155 [error] [MainThread]: Database Error in model test (models/intermediate/my_model.sql)
22:34:35.609412 [error] [MainThread]: syntax error at or near "FROM"
22:34:35.609717 [error] [MainThread]: LINE 29: FROM split_data