Why are CI jobs ("state" method) building more/less models/nodes than expected?

Snapshots schemas are dynamic based on the target instead of static.

If you’re using this common pattern Using dynamic schemas for snapshots to dynamically generate your snapshots target_schema, then the snapshot would be considered modified and thus, your snapshot and everything downstream of it will be executed (depending on your selection of course - i.e. the typical --select state:modified+ would).

There’s a call out to that directly in the linked discourse but I’m going to show a full working example here.

If we have a dbt project setup like so:

# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table
    +schema: dbt_jyeo
-- macros/gsn.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}
    {%- if target.name == 'prod' -%}
        dbt_jyeo_prod
    {%- elif target.name == 'ci' -%}
        dbt_jyeo_ci
    {%- else -%}
        {{ custom_schema_name }}
    {%- endif -%}
{%- endmacro %}

-- snapshots/snappy.sql
{% snapshot snappy %}
{{
    config(
        target_schema=generate_schema_name('dbt_jyeo'),
        unique_key='id',
        strategy='check',
        check_cols='all'
    )
}}

select 1 id, 'alice' as first_name
{% endsnapshot %}

-- models/foo.sql
select 1 id

-- models/from_snappy.sql
select * from {{ ref('snappy') }}

That is depending on the target, we want the snapshot to be built into a different schema.

First, let’s do a “production” run and then store the state (manifest.json) generated from the run:

$ dbt build --target prod
23:52:40  Running with dbt=1.5.6
23:52:41  Registered adapter: snowflake=1.5.3
23:52:42  Found 2 models, 0 tests, 1 snapshot, 0 analyses, 323 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
23:52:42  
23:52:45  Concurrency: 1 threads (target='prod')
23:52:45  
23:52:45  1 of 3 START sql table model dbt_jyeo_prod.foo ................................. [RUN]
23:52:48  1 of 3 OK created sql table model dbt_jyeo_prod.foo ............................ [SUCCESS 1 in 2.53s]
23:52:48  2 of 3 START snapshot dbt_jyeo_prod.snappy ..................................... [RUN]
23:52:53  2 of 3 OK snapshotted dbt_jyeo_prod.snappy ..................................... [success in 5.16s]
23:52:53  3 of 3 START sql table model dbt_jyeo_prod.from_snappy ......................... [RUN]
23:52:55  3 of 3 OK created sql table model dbt_jyeo_prod.from_snappy .................... [SUCCESS 1 in 2.11s]
23:52:55  
23:52:55  Finished running 2 table models, 1 snapshot in 0 hours 0 minutes and 13.36 seconds (13.36s).
23:52:55  
23:52:55  Completed successfully
23:52:55  
23:52:55  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

$ mv target target_old

Next, let’s modify just model foo:

-- models/foo.sql
select 2 id

And then let’s build using “state:modified” and defer:

$ dbt build --target ci --select state:modified+ --defer --state target_old
23:54:18  Running with dbt=1.5.6
23:54:19  Registered adapter: snowflake=1.5.3
23:54:20  Found 2 models, 0 tests, 1 snapshot, 0 analyses, 323 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
23:54:20  
23:54:24  Concurrency: 1 threads (target='ci')
23:54:24  
23:54:24  1 of 3 START sql table model dbt_jyeo_ci.foo ................................... [RUN]
23:54:26  1 of 3 OK created sql table model dbt_jyeo_ci.foo .............................. [SUCCESS 1 in 2.55s]
23:54:26  2 of 3 START snapshot dbt_jyeo_ci.snappy ....................................... [RUN]
23:54:32  2 of 3 OK snapshotted dbt_jyeo_ci.snappy ....................................... [success in 5.57s]
23:54:32  3 of 3 START sql table model dbt_jyeo_ci.from_snappy ........................... [RUN]
23:54:34  3 of 3 OK created sql table model dbt_jyeo_ci.from_snappy ...................... [SUCCESS 1 in 2.16s]
23:54:34  
23:54:34  Finished running 2 table models, 1 snapshot in 0 hours 0 minutes and 13.91 seconds (13.91s).
23:54:34  
23:54:34  Completed successfully
23:54:34  
23:54:34  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

So what we find is that we only modified foo.sql - however the snapshot and it’s downstream was also built. That is because of the conditional target_schema config on the snapshot, this has caused it to differ between the prod and ci runs (even though we did not modify it’s contents whatsoever) and because those snapshots are not the same, then “state:modified” necessarily includes the snapshot in the CI run.

Now, dbt is also able to detect env-aware configs if those configs are set in the dbt_project.yml file (Caveats to state comparison | dbt Developer Hub). Let’s try doing that.

Add the target_schema config to the dbt_project.yml file:

# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table
    +schema: dbt_jyeo

snapshots:
  my_dbt_project:
    +target_schema: "{{ 'dbt_jyeo_ci' if target.name == 'ci' else 'dbt_jyeo_prod' }}"

And then remove it from the snapshots config block:

-- snapshots/snappy.sql
{% snapshot snappy %}
{{
    config(
        unique_key='id',
        strategy='check',
        check_cols='all'
    )
}}

select 1 id, 'alice' as first_name
{% endsnapshot %}

Keep in mind that we cannot do +target_schema: "{{ generate_schema_name('dbt_jyeo') }}" in the dbt_project.yml file itself because dbt doesn’t parse macros in that file so we have to put our business logic straight inline.

Let’s try and repeat the above exercise now that we’ve moved the config from the config block to the project file.

$ dbt build --target prod
23:58:29  Running with dbt=1.5.6
23:58:31  Registered adapter: snowflake=1.5.3
23:58:31  Found 2 models, 0 tests, 1 snapshot, 0 analyses, 323 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
23:58:31  
23:58:35  Concurrency: 1 threads (target='prod')
23:58:35  
23:58:35  1 of 3 START sql table model dbt_jyeo_prod.foo ................................. [RUN]
23:58:37  1 of 3 OK created sql table model dbt_jyeo_prod.foo ............................ [SUCCESS 1 in 2.54s]
23:58:37  2 of 3 START snapshot dbt_jyeo_prod.snappy ..................................... [RUN]
23:58:43  2 of 3 OK snapshotted dbt_jyeo_prod.snappy ..................................... [success in 5.55s]
23:58:43  3 of 3 START sql table model dbt_jyeo_prod.from_snappy ......................... [RUN]
23:58:45  3 of 3 OK created sql table model dbt_jyeo_prod.from_snappy .................... [SUCCESS 1 in 2.09s]
23:58:45  
23:58:45  Finished running 2 table models, 1 snapshot in 0 hours 0 minutes and 13.79 seconds (13.79s).
23:58:45  
23:58:45  Completed successfully
23:58:45  
23:58:45  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

$ mv target target_old

Modify model foo.sql to be something different then do a ci run:

$ dbt build --target ci --select state:modified+ --defer --state target_old

00:00:42  Running with dbt=1.5.6
00:00:44  Registered adapter: snowflake=1.5.3
00:00:45  Found 2 models, 0 tests, 1 snapshot, 0 analyses, 323 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
00:00:45  
00:00:49  Concurrency: 1 threads (target='ci')
00:00:49  
00:00:49  1 of 1 START sql table model dbt_jyeo_ci.foo ................................... [RUN]
00:00:51  1 of 1 OK created sql table model dbt_jyeo_ci.foo .............................. [SUCCESS 1 in 2.37s]
00:00:51  
00:00:51  Finished running 1 table model in 0 hours 0 minutes and 6.07 seconds (6.07s).
00:00:51  
00:00:51  Completed successfully
00:00:51  
00:00:51  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Env-awareness of the target_schema config seems to have worked and only modified model foo as expected.