I want to author a model in dbt that processes each row exactly once. A sample use case is where I have a function expensive_api_call()
that I want to run per row, but since the results of this API call will never change I want to ensure that I’m only calling it once per row, rather than once per materialisation. Intuitively, I want to find which rows haven’t yet been processed, process them, and store the result in the model.
I figured that an incremental model would be the appropriate way to do this, with something like my_model.sql
below:
{{ config(materialized='incremental', unique_key='my_id') }}
WITH missing_mappings AS (
SELECT my_id FROM {{ ref('my_source_table') }}
EXCEPT (SELECT my_id FROM {{ ref('my_model') }})
)
SELECT my_id, expensive_api_call(my_id) AS api_call_result
FROM missing_mappings
While the syntax of the code snippet above may not be exactly right, I need to resolve the modelling issue first: dbt (correctly) detects a cycle because the model is referring to itself.
I don’t have an intrinsic way of detecting which my_id
values are new and require the API call to be made (e.g. with a timestamp column in the upstream table), other than by inspecting which IDs have been previously processed in my_model
.
In case it’s relevant, my data warehouse is Snowflake and expensive_api_call()
is a stored procedure in Snowflake.
How can I achieve what I want with dbt?