Modeling SQL update statements

I am relatively new to dbt and trying to convert a SQL script on Redshift into a bonafide model. The script is basically an insert into a lookup table followed by two dozen or so update statements (I know, awful) to bucket rows into categories based on string pattern matches.

I am curious about the best way to translate this pattern into a model. It seems like the initial insert could be materialized as an ephemeral table. If so, how do I subsequently run a series of update tables on this table? I don’t care about persisting or making queryable any of the intermediate states between the updates.

create table foobar as
select * 
from blah....;

update foobar
set column_name = boo
where other_column like '%something%';

update foobar
set column_name = boo
where other_column like '%something-else%';

update....
1 Like

I’m also wondering about this as well. Two of my tables need a lot of cleanup. And I have to use dozens of update statements and a few temp tables to get the final set of tables I need.

I am also new to DBT and was considering doing these transformations immediately after importing the raw data into the warehouse. By using a call to a stored procedure. And then in DBT referencing the source tables in a CTE to show the lineage between these “staged” tables and their raw data sources. Not sure if there is a better pattern.

I think the way to do it is using select queries rather than updates.

select 
    case 
        when other_column like '%something%' then 'boo'
        when other_column like '%something-else%' then 'boo-else'
        else 'no boo for you'
    end as column_name
from foobar

Unfortunately, I believe this solution would require a cleaner dataset. For instance in my source table I am doing an approximate search on the following fields (first_name, last_name, address_1, address_2, company, and source). (I know this is quite the data entry issue). But because this is external data I have less say on this.

As far as processing this data. I am searching through each of these fields for a list of known entity identifiers. By going through the fields and performing like %something% for each field. And then updating the record with the actual entity after finding approximate string matches on the different columns. Any new entities requires an update to the procedure.

Thanks @Kimcha! I wish I could translate this to a giant case statement but sadly the query will be too complex and difficult to read/debug with that. It sounds like there is no good way to achieve this within dbt. I concur with @anthonyae that the dataset passing into a dbt model just needs to be cleaner.

The pattern I am leaning towards right now is loading my staging table into a python data frame, running all the cleaning steps to categorize rows there and then passing that data into a table accessible in dbt. I am using an airflow DAG to orchestrate the initial cleanup step and then call all the downstream processing tasks. The downside is that my transformations are now not 100% within dbt but I can live with that.

If someone has a more elegant solution to this I am all ears!

:wave: Hey there! Welcome to dbt!

Based on the original post, here are a couple assumptions I’m making and an example approach below.

Assumptions:

  • blah is available to dbt as a source
  • We’re only interested in two columns (column_name, other_column)

With those assumptions, I’m thinking some Jinja is all you’d need to clean up the model.

Example:

This model, stg_foobar, basically replaces your create table by getting everything from blah loaded into foobar. You can schedule dbt to run this model as often as you need to make sure foobar has all the data needed from blah, assuming there is something in the omitted code that results in foobar not just being a copy of blah.

stg_foobar.sql

select * from {{ source(‘blah_container’, ‘blah_table’) }}

I imagine this next model, categorized_foobar, as a view, but it could be a table. The point here is that you don’t need to go back and constantly update any tables. Rather, you create this model to build on top of stg_foobar so that anytime data is read from categorized_foobar, it’s automatically using the evaluated columns.

The Jinja mappings is a variable you can use to cleanly define your mapping rules in a single place and prevent yourself from having to write a lengthy case statement.

categorized_foobar.sql

{% set mappings = {'something': 'boo', 'something-else': 'boo-else'} %}

with source as (
		select * from {{ ref(‘stg_foobar’) }}
),

final as ( 

		select
			case
			  {% for old, new in mappings %}
		        when other_column like ‘{{old}}’ then ‘{{new}}’
	          {% endfor %}
		    end as column_name
		from
			source

)

select * from final

Of course, if the complexity of the string matching is the crux of the problem here (that doesn’t stand out to me in the original question), we might look at another approach.

P.S. I’m drafting this up without access to run/test it, so you may run into some syntax issues, but I think conceptually it should work.