How to do 'OUTER APPLY STRING_SPLIT()' in dbt?

Hello friendly people!

I am migrating MSSQL (on premise) code to an Azure stack that includes dbt. dbt is fantastic, but I am not sure how I can translate the following code to something that dbt accepts:

,CTE_ToBeMapped AS (
	SELECT sa.*
		  ,LTRIM(RTRIM(VALUE)) AS AnswerValue_StringSplit
	FROM {{ ref('database__full__table') }} sa
	INNER JOIN CTE_MappingExportIdToBeMapped m
		ON sa.QuestionKeyCoalesce = m.QuestionKeyOldMC
	OUTER APPLY STRING_SPLIT(AnswerValue, '|')
)

Before this CTE, each QuestionKey can hold multiple AnswerValues, such as ‘Answer1|Answer2|Answer3’. This CTE is supposed to generate one row for each question-answer pair, so that the above example (one question with three answers) results in three rows (one question with one answer on each row).

For more context, the full query up to this point. I translated everything to something understandable in English, but the CHARINDEX line might be confusing; the QuestionKeys are normally in format ‘QUESTION’, but if they are multiplechoice questions from one specific system, they would be ‘QUESTION_1’, ‘QUESTION_2’, et cetera, with each holding only one answer in its AnswerValue. The other system, however, would have ‘QUESTION’ with all the possible answers in its AnswerValue. I need to translate the ‘QUESTION’ ‘Answer1|Answer2’ variant to the ‘QUESTION_1’ ‘Answer1’, ‘QUESTION_2’, ‘Answer2’ variant.

--The mapper, filtered on relevant multiplechoice rows
WITH CTE_MappingExportId AS (
	SELECT DISTINCT ExportId AS QuestionKey6
		  ,LEFT(ExportId, LEN(ExportId) - CHARINDEX('_',REVERSE(ExportId))) AS QuestionKeyOldMC
	      ,QuestionText AS QuestionText2
		  ,LTRIM(RTRIM(AnswerValues)) AS AnswerValue3
	FROM {{ ref('database__seed__mapper') }}
	WHERE (Type = 'Multiplechoice' OR Type = 'Multiple answers possible')
)

--Shortlist of old mapping codes of the multiple choice codes (i.e. the code without _1, _2, et cetera)
,CTE_MappingExportIdToBeMapped AS (
	SELECT DISTINCT QuestionKeyOldMC
	FROM CTE_MappingExportId
)

--The rows that can be mapped
--We have a set with only multiplechoice codes from the mapper
--And from this set we can safely use string split to turn every answer into a single row
,CTE_ToBeMapped AS (
	SELECT sa.*
		  ,LTRIM(RTRIM(VALUE)) AS AnswerValue_StringSplit
	FROM {{ ref('database__full__table') }} sa
	INNER JOIN CTE_MappingExportIdToBeMapped m
		ON sa.QuestionKeyCoalesce = m.QuestionKeyOldMC
	OUTER APPLY STRING_SPLIT(AnswerValue, '|')
)

I am not entirely sure what dbt cannot handle. I suspect it is the STRING_SPLIT, but possibly also the OUTER APPLY. I could in theory work around the OUTER APPLY (although it would make the code very messy), but I do not see how I can work around STRING_SPLIT. I tried playing around with macros, but, my knowledge of dbt is just too limited to give meaningful examples of what didn’t work here. I found only one similar example in this forum, but that is for a static ‘split left and split right’ function, while I could in theory need to generate ten new lines because I might have ten multiplechoice answers.

I hope someone can help me translate the above code to something dbt accepts - thank you very much in advance! :slight_smile:

This doesn’t sound like a dbt question, sounds more like an Azure question for what the equivalent would be in your specific Azure database.

Note: @Jack Nguyen originally posted this reply in Slack. It might not have transferred perfectly.

1 Like

That is a super intelligent suggestion! Unfortunately, when I perform similar code in my Azure SQL database, it works perfectly fine. How could that be?

EDIT: No, you are actually right, if I perform this code on a Notebook in Databricks, it gives the same error. So how do I update/upgrade/whatever my Databricks to a higher compatibility level (or, whatever I need to do)…

EDIT2: I solved it - if people are interested in the solution, this is the code I now use in dbt/Databricks:

,CTE_ToBeMapped AS (
    SELECT sa.*
          ,explode(split(Answer, '\\|')) AS Answer_StringSplit
    FROM {{ ref('database__full__table') }} sa
    INNER JOIN CTE_MappingExportIdToBeMapped m
        ON sa.QuestionKey = m.QuestionKeyOldMC
)

Hi,

i am facing a similar issue. I have a table with unique id as one column and another column with comma separated values ex- a,b,c
table i have
col1 col2
1 a,b,c
2 a,b

what i am trying to get as result
col1 col
1 a
1 b
1 c
2 a
2 b

your solution is not working for me, getting error ((42601): SQL compilation error: Unknown function EXPLODE)
i tried cross apply string_split(col2, ’ , ') and unnest but no luck. Can you please help me if you know how to do this? Thanks in advance.