Modeling SQL update statements

bdawg · May 15, 2020, 2:19pm

I am relatively new to dbt and trying to convert a SQL script on Redshift into a bonafide model. The script is basically an insert into a lookup table followed by two dozen or so update statements (I know, awful) to bucket rows into categories based on string pattern matches.

I am curious about the best way to translate this pattern into a model. It seems like the initial insert could be materialized as an ephemeral table. If so, how do I subsequently run a series of update tables on this table? I don’t care about persisting or making queryable any of the intermediate states between the updates.

create table foobar as
select * 
from blah....;

update foobar
set column_name = boo
where other_column like '%something%';

update foobar
set column_name = boo
where other_column like '%something-else%';

update....

anthonyae · May 24, 2020, 10:44pm

I’m also wondering about this as well. Two of my tables need a lot of cleanup. And I have to use dozens of update statements and a few temp tables to get the final set of tables I need.

I am also new to DBT and was considering doing these transformations immediately after importing the raw data into the warehouse. By using a call to a stored procedure. And then in DBT referencing the source tables in a CTE to show the lineage between these “staged” tables and their raw data sources. Not sure if there is a better pattern.

Kimcha · May 26, 2020, 5:45pm

I think the way to do it is using select queries rather than updates.

select 
    case 
        when other_column like '%something%' then 'boo'
        when other_column like '%something-else%' then 'boo-else'
        else 'no boo for you'
    end as column_name
from foobar

anthonyae · May 26, 2020, 6:20pm

Unfortunately, I believe this solution would require a cleaner dataset. For instance in my source table I am doing an approximate search on the following fields (first_name, last_name, address_1, address_2, company, and source). (I know this is quite the data entry issue). But because this is external data I have less say on this.

As far as processing this data. I am searching through each of these fields for a list of known entity identifiers. By going through the fields and performing like %something% for each field. And then updating the record with the actual entity after finding approximate string matches on the different columns. Any new entities requires an update to the procedure.

bdawg · May 26, 2020, 8:05pm

Thanks @Kimcha! I wish I could translate this to a giant case statement but sadly the query will be too complex and difficult to read/debug with that. It sounds like there is no good way to achieve this within dbt. I concur with @anthonyae that the dataset passing into a dbt model just needs to be cleaner.

The pattern I am leaning towards right now is loading my staging table into a python data frame, running all the cleaning steps to categorize rows there and then passing that data into a table accessible in dbt. I am using an airflow DAG to orchestrate the initial cleanup step and then call all the downstream processing tasks. The downside is that my transformations are now not 100% within dbt but I can live with that.

If someone has a more elegant solution to this I am all ears!

kyle_r · May 27, 2020, 5:02am

Hey there! Welcome to dbt!

Based on the original post, here are a couple assumptions I’m making and an example approach below.

Assumptions:

blah is available to dbt as a source
We’re only interested in two columns (column_name, other_column)

With those assumptions, I’m thinking some Jinja is all you’d need to clean up the model.

Example:

This model, stg_foobar, basically replaces your create table by getting everything from blah loaded into foobar. You can schedule dbt to run this model as often as you need to make sure foobar has all the data needed from blah, assuming there is something in the omitted code that results in foobar not just being a copy of blah.

stg_foobar.sql

select * from {{ source(‘blah_container’, ‘blah_table’) }}

I imagine this next model, categorized_foobar, as a view, but it could be a table. The point here is that you don’t need to go back and constantly update any tables. Rather, you create this model to build on top of stg_foobar so that anytime data is read from categorized_foobar, it’s automatically using the evaluated columns.

The Jinja mappings is a variable you can use to cleanly define your mapping rules in a single place and prevent yourself from having to write a lengthy case statement.

categorized_foobar.sql

{% set mappings = {'something': 'boo', 'something-else': 'boo-else'} %}

with source as (
		select * from {{ ref(‘stg_foobar’) }}
),

final as ( 

		select
			case
			  {% for old, new in mappings %}
		        when other_column like ‘{{old}}’ then ‘{{new}}’
	          {% endfor %}
		    end as column_name
		from
			source

)

select * from final

Of course, if the complexity of the string matching is the crux of the problem here (that doesn’t stand out to me in the original question), we might look at another approach.

P.S. I’m drafting this up without access to run/test it, so you may run into some syntax issues, but I think conceptually it should work.

visch · August 10, 2021, 12:31pm

update #Active
set Column= Column + ‘;474’
where Column like ‘%475%’;

Is a good example of something that’s just better in temp tables. Maybe there’s a better way?

dancook · October 2, 2021, 12:58am

I have a scenario similar to the OP, but in my case I’m interesting in merge_updating more than two columns. In fact my problem is the generalized version because all columns are in play. In other words, let’s say there are ~20 passes (models) to make over the target table, and in any given model we’ll be affecting a different subset of the available columns.

I do expect to make use of the merge_update_columns feature of incremental models (in Snowflake), but to quote @claire, “dbt operates under a paradigm of one model (i.e. a .sql file in your models/ directory) is represented by one object (table/view) in your data warehouse”.

So my primary question is this: since I can’t use the same file name for all ~20 models, is it safe to assume:

each model will build a temp/ephemeral table which does a merge_update over the ephemeral model which ‘precedes’ it the DAG?
The first of my models effectively kicks off the train by being a view over the target table, and…
The last of my models is the only one whose name basically = target_table.sql, and it merges the sum of all changes from the preceding models into the target table?

Topic		Replies	Views
Converting legacy sql build scripts into models Archive	0	2187	August 3, 2021
Handling merge statements in incremental models Help	2	14045	July 23, 2019
What should I do instead of creating a temporary table in dbt? Help incremental , snowflake	5	12833	October 20, 2022
Inserting data into Table A with sequential transformations in dbt Help incremental , modeling	3	4427	January 30, 2025
converting Redshift stored procedures into Snowflake tables using dbt Help variables	0	634	January 15, 2024

Modeling SQL update statements

Related topics