Testing incremental models

I love how dbt allows me to test schema and data :heart:, but there’s one thing I haven’t figured out - it’s testing of incremental models.

My typical CI test run looks like this

dbt deps --profiles-dir $DBT_PROFILE_DIR
dbt seed --profiles-dir $DBT_PROFILE_DIR
dbt run  --profiles-dir $DBT_PROFILE_DIR
dbt test --profiles-dir $DBT_PROFILE_DIR

That works well for the --full-refresh case, which is basically what CI does, since it always run in a clean environment and DB. But the --full-refresh case is rare in production and what runs every day is processing incremental changes, which is left untested.

Not sure if it’s relevant, but I use static data in my seeds to cover all edge cases. So the increment isn’t always “today”, but rather some static date in the past.

What would be a good approach to correctly test full + incremental transformations?

Cheers!

1 Like

One thing that I’ve found makes this testing easier is to tag each row to identify whether it was built by your full-refresh or by a subsequent incremental. You can do that with the is_incremental() function.

To make use of that, you can just add a column to each incremental table: “build_source”, like so:

{% if is_incremental() %}
‘incremental’
{% else %}
‘full’
{% endif %} as “build_source”,

When your model is built with --full-refresh (full build). and then without it (incrementally), the rows will either have ‘incremental’ or ‘full’ as the value of column “build_source”.

Once the column is populated, you can make use of that in your asserts and other test queries to see what the incremental did, like:

select * from inventory
where build_source = 'incremental'
order by inventory.item_id, updated_at

… and of course you can write more sophisticated queries to test different conditions explicitly.

1 Like

@terry_joyce that is a very useful suggestion, thanks! Can you advise how you implemented testing in CI? We want to run dbt twice - once for the build_source run and a second time for the incremental run, and the source data will need to be updated between runs

Hi Rob,

Since the table now contains the build_source column, it’s a simple matter to identify which records were touched by the incremental. You’d want to check that you are updating the records you intend to.

If your incremental is efficient, it will have little overlap: records that get re-written on multiple passes of the incremental. But it’s not always desirable to have zero overlap. Likely your incremental logic depends on one or more high water marks. These are often max values of timestamp columns. But depending on the nature of the source data, sometimes the timestamps you’re relying on are created a bit out of order, and if you had no overlap you’d miss some rows.

So that’s one of the things you’d test for: are you missing any records from the source data? That probably requires a comparative query of the transformed table and the source data, looking at a specific datetime range after your last full build.

You definitely want to run your schema tests on the updated models. Look for unexpected null values and non-unique keys where you expect them. That might tell you if you’re missing data, or if you’re duplicating data. Generally if your unique key is specified correctly, dbt will make sure you don’t duplicate data.

I also like to do some column-by-column compares of rows created by incremental vs. before the incremental ran.

Another aspect is efficiency. How long do these updates run per model? That you can just measure with the runtime information you’ll get from dbt cloud or your logs. It might take some experimentation to find the most efficient way to structure the dbt model to operate efficiently during incremental updates.

I’m sure there’s a lot more written on this subject of testing incrementals. It’s a big topic, and one to which a more specific answer will include “Depends on …”