Incremental Load - First Run, Check Table Exists/Contains Rows

I understand on the first run of an incremental model all source rows will be added and a table created if it doesn’t exist:

Incremental models are built as tables in your data warehouse. The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, dbt transforms only the rows in your source data that you tell dbt to filter for, inserting them into the target table which is the table that has already been built.

In my case, the source model contains multiple rows per ID:

ID Region Timestamp
1 X 01/01/2023
1 Z 02/01/2023
1 A 03/01/2023

Within my incremental model I’d like to include a single row per ID that’s associated with the latest timestamp. In the example provided above only the following row would appear in the incremental model:

ID Region Timestamp
1 A 03/01/2023

Each time I run the incremental model I’d like to:

  • Check if the SQL table generated by the incremental model already exists on the database
  • For robustness, if the SQL table exists is not empty/contains rows

If the table doesn’t exists or the table does not contain any rows the following script would run:

SELECT a.*
FROM  {{ref('orders')}}  a
INNER JOIN (SELECT id, MAX(timestamp) AS MAX_timestamp 			
            FROM  {{ref('orders')}} 
			GROUP BY id) AS max_a ON a.id = max_a.id

I believe the initial checks above would be common design pattern and was wondering if there are any standard design patterns or set of functions to use?

I found the following thread helpful but would like to ask if others have any alternative suggestions/ best practices?

Any pointers would be much appreciated :+1:

1 Like

dbt incremental models are the same as any other model type, in that you don’t need to (and shouldn’t!) spend any time thinking about whether the table already exists, and minimal time thinking about whether the table is empty or not.

This manifests as a query that transforms data in the same way whether or not the table exists/contains rows, with the only difference being how much data is processed.

Have you looked into the is_incremental() macro? Basically what you’d do is

with orders as (
  select * from {{ ref('orders') }}
),

windowed as (
  select   
    id, 
    region, 
    timestamp, 
    row_number() over (partition by id order by timestamp desc) as rownum
  from orders

  {% if is_incremental() %}
    -- if the table doesn't exist, then every order row will be processed. 
    -- If it does, then only rows that have changed since the last invocation need to be processed.
    where exists (
      select null   
      from {{ this }} as this
      where this.id = orders.id
      and orders.timestamp > this.timestamp
    )
  {% endif %}
),

final as (
  select 
    id,
    region,
    timestamp
  from orders
  where rownum = 1
)

select * from final

The article you linked refers to source tables existing or not, i.e. the raw untransformed tables created by your ETL tool.

Many thanks for this.
My initial SQL query to populate the incremental model wasn’t in a particularly dbt friendly format.

Your example has been a great help and I can use that as a template in the future. :+1:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.