I understand on the first run of an incremental model all source rows will be added and a table created if it doesn’t exist:
Incremental models are built as tables in your data warehouse. The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, dbt transforms only the rows in your source data that you tell dbt to filter for, inserting them into the target table which is the table that has already been built.
In my case, the source model contains multiple rows per ID:
ID
Region
Timestamp
1
X
01/01/2023
1
Z
02/01/2023
1
A
03/01/2023
Within my incremental model I’d like to include a single row per ID that’s associated with the latest timestamp. In the example provided above only the following row would appear in the incremental model:
ID
Region
Timestamp
1
A
03/01/2023
Each time I run the incremental model I’d like to:
Check if the SQL table generated by the incremental model already exists on the database
For robustness, if the SQL table exists is not empty/contains rows
If the table doesn’t exists or the table does not contain any rows the following script would run:
SELECT a.*
FROM {{ref('orders')}} a
INNER JOIN (SELECT id, MAX(timestamp) AS MAX_timestamp
FROM {{ref('orders')}}
GROUP BY id) AS max_a ON a.id = max_a.id
I believe the initial checks above would be common design pattern and was wondering if there are any standard design patterns or set of functions to use?
I found the following thread helpful but would like to ask if others have any alternative suggestions/ best practices?
dbt incremental models are the same as any other model type, in that you don’t need to (and shouldn’t!) spend any time thinking about whether the table already exists, and minimal time thinking about whether the table is empty or not.
This manifests as a query that transforms data in the same way whether or not the table exists/contains rows, with the only difference being how much data is processed.
Have you looked into the is_incremental() macro? Basically what you’d do is
with orders as (
select * from {{ ref('orders') }}
),
windowed as (
select
id,
region,
timestamp,
row_number() over (partition by id order by timestamp desc) as rownum
from orders
{% if is_incremental() %}
-- if the table doesn't exist, then every order row will be processed.
-- If it does, then only rows that have changed since the last invocation need to be processed.
where exists (
select null
from {{ this }} as this
where this.id = orders.id
and orders.timestamp > this.timestamp
)
{% endif %}
),
final as (
select
id,
region,
timestamp
from orders
where rownum = 1
)
select * from final
The article you linked refers to source tables existing or not, i.e. the raw untransformed tables created by your ETL tool.