Why are CI jobs ("state" method) building more/less models/nodes than expected?

Starting a topic to discuss scenarios where using the state:modified method (and it’s variation) are resulting dbt running more or less models (nodes) than one might expect.


Other helpful tips

How to do self-diagnosis

To understand how to debug dbt Cloud Slim CI runs yourself when you think “too many/few models are being built”, watch this quick loom explainer on how that works in general:

https://www.loom.com/share/445fb6372f8f4e669b1a6575d70c45f8

Finding more information in the dbt Cloud UI

The dbt Cloud runs UI gives you:

  • Which exact (Prod job) run the CI job run deferred to (since that deferred run is where you should be getting your manifest.json file from).
  • The exact commit shas of each run (in the example above, my CI job was run on commit 4f147a9 and my prod job was run on commit b85061d) - so you can look at your repository and do a comparsion between the 2 commits to see what has changed (see below).

Comparing differences between 2 commits

To compare what is different between 2 commits, we can use the following URL pattern:

https://github.com/<owner-name>/<repo-name>/compare/<sha-1>...<sha-2>

For GitLab:

https://gitlab.com/<repo-path>/-/compare/<sha-1>...<sha-2>

For Azure DevOps:

https://dev.azure.com/<organization>/<project>/_git/<repo-name>/branchCompare?baseVersion=GC<sha-1>&targetVersion=GC<sha-2>&_a=files

So for my example above, I would go to:

https://github.com/jeremyyeo/dbt-basic/compare/b85061d...4f147a9

Which then shows me exactly what is different between those 2 commits:

Tip: If you try and do this compare but see that there are no changes - you want to try and swap those sha’s around - i.e. if you tried /compare/abc...def then try the reverse instead /compare/def...abc.

1 Like

Source properties are different

Assuming we have a dbt project (profiles.yml) setup with 2 different target names:

# ~/.dbt/profiles.yml
snowflake:
  target: prod
  outputs:
    prod: &sf-creds
      type: snowflake
      ...
    ci: *sf-creds

# dbt_project.yml
name: my_dbt_project
profile: snowflake
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table

# models/sources.yml
# There is the same `users` source table here but two of it exists - one in each database.
version: 2
sources:
  - name: company_foo
    database: '{{ "development" if target.name == "prod" else "development_jyeo" }}'
    tables:
      - name: users
-- models/foo_1.sql
select * from {{ source('company_foo', 'users') }}

-- models/foo_2.sql
select * from {{ ref('foo_1') }}

-- models/bar_1.sql
select 1 as id

-- models/bar_2.sql
select * from {{ ref('bar_2') }}

Note that dbt Cloud also has target names configurable per dbt Cloud job. The job shown here has a target.name == 'ci':

  1. First lets generate a manifest.json of a production run - which will be deferred to later in our subsequent ci run. Be sure to move the generate artifacts to it’s own folder as well.
$ dbt ls --target prod
my_dbt_project.bar_1
my_dbt_project.bar_2
my_dbt_project.foo_1
my_dbt_project.foo_2
source:my_dbt_project.company_foo.users
$ mv target target_old

Some users do dbt compile instead which works too. Most commonly, this would be dbt run as production jobs are meant to be building models. In special circumstances do users want the job being deferred to to be an ls or compile job.

  1. Now let’s modify our model bar_1.sql but not anything else.
-- models/bar_1.sql
select 2 as id
  1. Let’s do a run with state deferral.
$ dbt run -s state:modified --defer --state target_old --target ci
22:51:46  Running with dbt=1.4.5
22:51:47  Found 4 models, 0 tests, 0 snapshots, 0 analyses, 307 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
22:51:47  
22:51:52  Concurrency: 1 threads (target='ci')
22:51:52  
22:51:52  1 of 1 START sql table model dbt_jyeo.bar_1 .................................... [RUN]
22:51:57  1 of 1 OK created sql table model dbt_jyeo.bar_1 ............................... [SUCCESS 1 in 4.65s]
22:51:57  
22:51:57  Finished running 1 table model in 0 hours 0 minutes and 10.18 seconds (10.18s).
22:51:57  
22:51:57  Completed successfully
22:51:57  
22:51:57  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Note that dbt Cloud automatically does deferral to the job selected - we don’t need --state --defer with dbt Cloud. You also don’t need to specify --target since that would be set in the dbt Cloud job UI.

Here - there is no confusion - we modified model bar_1 and indeed it was the only one that was changed.

  1. Most of the time though, we also want to run models that are downstream since changes to bar_1, could also affect bar_2 - and to do this, we add a plus (+) to our selector. Let’s try that:
$ dbt run -s state:modified+ --defer --state target_old --target ci
22:56:37  Running with dbt=1.4.5
22:56:38  Found 4 models, 0 tests, 0 snapshots, 0 analyses, 307 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
22:56:38  
22:56:44  Concurrency: 1 threads (target='ci')
22:56:44  
22:56:44  1 of 4 START sql table model dbt_jyeo.bar_1 .................................... [RUN]
22:56:49  1 of 4 OK created sql table model dbt_jyeo.bar_1 ............................... [SUCCESS 1 in 4.54s]
22:56:49  2 of 4 START sql table model dbt_jyeo.foo_1 .................................... [RUN]
22:56:53  2 of 4 OK created sql table model dbt_jyeo.foo_1 ............................... [SUCCESS 1 in 4.13s]
22:56:53  3 of 4 START sql table model dbt_jyeo.bar_2 .................................... [RUN]
22:56:57  3 of 4 OK created sql table model dbt_jyeo.bar_2 ............................... [SUCCESS 1 in 4.23s]
22:56:57  4 of 4 START sql table model dbt_jyeo.foo_2 .................................... [RUN]
22:57:02  4 of 4 OK created sql table model dbt_jyeo.foo_2 ............................... [SUCCESS 1 in 4.72s]
22:57:02  
22:57:02  Finished running 4 table models in 0 hours 0 minutes and 23.71 seconds (23.71s).
22:57:02  
22:57:02  Completed successfully
22:57:02  
22:57:02  Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4

Note we’re still deferring to the manifest.json generated earlier with dbt ls in step (1) and not the manifest.json generated in step (3).

Now, many more models have been executed. For bar_2 - it’s not surprising since bar_2 is directly downstream of bar_1. But why foo_1 (and subsequently foo_2)? Well, that’s due to how we’ve defined our source which foo_1 uses - recall:

# models/sources.yml
version: 2
sources:
  - name: company_foo
    database: '{{ "development" if target.name == "prod" else "development_jyeo" }}'
    tables:
      - name: users

In our production job (target == ‘prod’), the source evaluated to development.company_foo.users but in our ci job (target == ‘ci’), the source is instead development_jyeo.company_foo.users - this means that YES, the source is detected as being modified. But of course, even if a source is modified, run -s state:modifed doesn’t actually do ANYTHING to sources (since sources are not run). But as soon as we do run -s state:modified+ - we would be “running” things that are downstream of the changed source - which is of course foo_1 (and foo_2 by extension).

1 Like

Parent/child nodes ARE modified (whether on purpose or not) but have been excluded (via --exclude)

This is another scenario that may trip folks up as you have to not only understand “state” and the “plus” (+) graph operator but also it’s interactions with node exclusions at the same time.

Let’s setup a toy project like so (copied from the previous post):

# ~/.dbt/profiles.yml
snowflake:
  target: prod
  outputs:
    prod: &sf-creds
      type: snowflake
      ...
    ci: *sf-creds

# dbt_project.yml
name: my_dbt_project
profile: snowflake
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table

And some models like:

-- models/foo.sql
select 1 as id

-- models/bar.sql
select 1 as id

-- models/staging/stg.sql
{{ config(materialized='incremental', unique_key='id') }}
select * from {{ ref('foo') }}

We can assume for a second that models in the staging folder specifically - we don’t want them to be run during Slim CI jobs because they perhaps contain or process a lot of data (of course this is not the case with this toy example but just imagine it :slight_smile:) - we want to manually test them. What does this mean? This means that you most likely will be running CI jobs with a command that’s similar to:

dbt run --select ... --exclude staging

Replace ... above with some variation of state:modifed.


Okay, now let’s do our first “production” run.

$ dbt run --full-refresh
01:12:06  Running with dbt=1.4.5
01:12:07  Found 3 models, 0 tests, 0 snapshots, 0 analyses, 308 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
01:12:07  
01:12:13  Concurrency: 1 threads (target='default')
01:12:13  
01:12:13  1 of 3 START sql table model dbt_jyeo.bar ...................................... [RUN]
01:12:17  1 of 3 OK created sql table model dbt_jyeo.bar ................................. [SUCCESS 1 in 3.64s]
01:12:17  2 of 3 START sql table model dbt_jyeo.foo ...................................... [RUN]
01:12:20  2 of 3 OK created sql table model dbt_jyeo.foo ................................. [SUCCESS 1 in 3.52s]
01:12:20  3 of 3 START sql incremental model dbt_jyeo.stg ................................ [RUN]
01:12:24  3 of 3 OK created sql incremental model dbt_jyeo.stg ........................... [SUCCESS 1 in 4.00s]
01:12:24  
01:12:24  Finished running 2 table models, 1 incremental model in 0 hours 0 minutes and 16.96 seconds (16.96s).
01:12:24  
01:12:24  Completed successfully
01:12:24  
01:12:24  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

Nothing surprising here - all 3 models are run as expected. Now let’s move our target folder as we did previously so we can defer to the manifest.json that was generated above.

$ mv target target_old

And then let’s make two changes.

(1) Let’s modify model bar:

-- models/bar.sql
select 2 as id

(2) Let’s modify our dbt_project.yml like so:

# dbt_project.yml
name: my_dbt_project
profile: snowflake
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table
    staging:
      +incremental_strategy: "delete+insert"
# ^ These last 2 lines are newly added to the file ^ #

Note, in very large projects with many models - you MAY not even know that by adding those 2 lines - you have inadvertently caused the incremental model stg to be “modified” since model configs can be set in many places (the model’s own config() block, in the dbt_project.yml file - like we did here, or even it’s property yml file).

Okay, now let’s do a CI run:

dbt run -s +state:modified+ --exclude models/staging --defer --state target_old

This command says to:

  1. Build anything that’s modified.
  2. Build anything that’s modified and all it’s child / downstream nodes.
  3. Build anything that’s modified and all it’s parent / upstream nodes.
  4. Do not build anything that is excluded - which is simply the model stg.

Let’s see what happens:

$ dbt run -s +state:modified+ --exclude models/staging --defer --state target_old
01:13:37  Running with dbt=1.4.5
01:13:39  Found 3 models, 0 tests, 0 snapshots, 0 analyses, 308 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
01:13:39  
01:13:45  Concurrency: 1 threads (target='default')
01:13:45  
01:13:45  1 of 2 START sql table model dbt_jyeo.bar ...................................... [RUN]
01:13:48  1 of 2 OK created sql table model dbt_jyeo.bar ................................. [SUCCESS 1 in 3.67s]
01:13:48  2 of 2 START sql table model dbt_jyeo.foo ...................................... [RUN]
01:13:52  2 of 2 OK created sql table model dbt_jyeo.foo ................................. [SUCCESS 1 in 3.25s]
01:13:52  
01:13:52  Finished running 2 table models in 0 hours 0 minutes and 12.75 seconds (12.75s).
01:13:52  
01:13:52  Completed successfully
01:13:52  
01:13:52  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

As we can see above - we didn’t modify foo but foo was included in the +state:modified+ selection. Why is that? Because we modified it’s child stg’s configuration (by adding an incremental_strategy config to it where one wasn’t there before). Thus stg is modified and upstream of stg is foo.

The exclusion --exclude models/staging simply means “exclude the node itself from running” - and that’s all it means and not more - as in - it does not mean “exclude the node itself from running plus any parent/child nodes from/of it”.

Snapshots schemas are dynamic based on the target instead of static.

If you’re using this common pattern Using dynamic schemas for snapshots to dynamically generate your snapshots target_schema, then the snapshot would be considered modified and thus, your snapshot and everything downstream of it will be executed (depending on your selection of course - i.e. the typical --select state:modified+ would).

There’s a call out to that directly in the linked discourse but I’m going to show a full working example here.

If we have a dbt project setup like so:

# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table
    +schema: dbt_jyeo
-- macros/gsn.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}
    {%- if target.name == 'prod' -%}
        dbt_jyeo_prod
    {%- elif target.name == 'ci' -%}
        dbt_jyeo_ci
    {%- else -%}
        {{ custom_schema_name }}
    {%- endif -%}
{%- endmacro %}

-- snapshots/snappy.sql
{% snapshot snappy %}
{{
    config(
        target_schema=generate_schema_name('dbt_jyeo'),
        unique_key='id',
        strategy='check',
        check_cols='all'
    )
}}

select 1 id, 'alice' as first_name
{% endsnapshot %}

-- models/foo.sql
select 1 id

-- models/from_snappy.sql
select * from {{ ref('snappy') }}

That is depending on the target, we want the snapshot to be built into a different schema.

First, let’s do a “production” run and then store the state (manifest.json) generated from the run:

$ dbt build --target prod
23:52:40  Running with dbt=1.5.6
23:52:41  Registered adapter: snowflake=1.5.3
23:52:42  Found 2 models, 0 tests, 1 snapshot, 0 analyses, 323 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
23:52:42  
23:52:45  Concurrency: 1 threads (target='prod')
23:52:45  
23:52:45  1 of 3 START sql table model dbt_jyeo_prod.foo ................................. [RUN]
23:52:48  1 of 3 OK created sql table model dbt_jyeo_prod.foo ............................ [SUCCESS 1 in 2.53s]
23:52:48  2 of 3 START snapshot dbt_jyeo_prod.snappy ..................................... [RUN]
23:52:53  2 of 3 OK snapshotted dbt_jyeo_prod.snappy ..................................... [success in 5.16s]
23:52:53  3 of 3 START sql table model dbt_jyeo_prod.from_snappy ......................... [RUN]
23:52:55  3 of 3 OK created sql table model dbt_jyeo_prod.from_snappy .................... [SUCCESS 1 in 2.11s]
23:52:55  
23:52:55  Finished running 2 table models, 1 snapshot in 0 hours 0 minutes and 13.36 seconds (13.36s).
23:52:55  
23:52:55  Completed successfully
23:52:55  
23:52:55  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

$ mv target target_old

Next, let’s modify just model foo:

-- models/foo.sql
select 2 id

And then let’s build using “state:modified” and defer:

$ dbt build --target ci --select state:modified+ --defer --state target_old
23:54:18  Running with dbt=1.5.6
23:54:19  Registered adapter: snowflake=1.5.3
23:54:20  Found 2 models, 0 tests, 1 snapshot, 0 analyses, 323 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
23:54:20  
23:54:24  Concurrency: 1 threads (target='ci')
23:54:24  
23:54:24  1 of 3 START sql table model dbt_jyeo_ci.foo ................................... [RUN]
23:54:26  1 of 3 OK created sql table model dbt_jyeo_ci.foo .............................. [SUCCESS 1 in 2.55s]
23:54:26  2 of 3 START snapshot dbt_jyeo_ci.snappy ....................................... [RUN]
23:54:32  2 of 3 OK snapshotted dbt_jyeo_ci.snappy ....................................... [success in 5.57s]
23:54:32  3 of 3 START sql table model dbt_jyeo_ci.from_snappy ........................... [RUN]
23:54:34  3 of 3 OK created sql table model dbt_jyeo_ci.from_snappy ...................... [SUCCESS 1 in 2.16s]
23:54:34  
23:54:34  Finished running 2 table models, 1 snapshot in 0 hours 0 minutes and 13.91 seconds (13.91s).
23:54:34  
23:54:34  Completed successfully
23:54:34  
23:54:34  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

So what we find is that we only modified foo.sql - however the snapshot and it’s downstream was also built. That is because of the conditional target_schema config on the snapshot, this has caused it to differ between the prod and ci runs (even though we did not modify it’s contents whatsoever) and because those snapshots are not the same, then “state:modified” necessarily includes the snapshot in the CI run.

Now, dbt is also able to detect env-aware configs if those configs are set in the dbt_project.yml file (Caveats to state comparison | dbt Developer Hub). Let’s try doing that.

Add the target_schema config to the dbt_project.yml file:

# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table
    +schema: dbt_jyeo

snapshots:
  my_dbt_project:
    +target_schema: "{{ 'dbt_jyeo_ci' if target.name == 'ci' else 'dbt_jyeo_prod' }}"

And then remove it from the snapshots config block:

-- snapshots/snappy.sql
{% snapshot snappy %}
{{
    config(
        unique_key='id',
        strategy='check',
        check_cols='all'
    )
}}

select 1 id, 'alice' as first_name
{% endsnapshot %}

Keep in mind that we cannot do +target_schema: "{{ generate_schema_name('dbt_jyeo') }}" in the dbt_project.yml file itself because dbt doesn’t parse macros in that file so we have to put our business logic straight inline.

Let’s try and repeat the above exercise now that we’ve moved the config from the config block to the project file.

$ dbt build --target prod
23:58:29  Running with dbt=1.5.6
23:58:31  Registered adapter: snowflake=1.5.3
23:58:31  Found 2 models, 0 tests, 1 snapshot, 0 analyses, 323 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
23:58:31  
23:58:35  Concurrency: 1 threads (target='prod')
23:58:35  
23:58:35  1 of 3 START sql table model dbt_jyeo_prod.foo ................................. [RUN]
23:58:37  1 of 3 OK created sql table model dbt_jyeo_prod.foo ............................ [SUCCESS 1 in 2.54s]
23:58:37  2 of 3 START snapshot dbt_jyeo_prod.snappy ..................................... [RUN]
23:58:43  2 of 3 OK snapshotted dbt_jyeo_prod.snappy ..................................... [success in 5.55s]
23:58:43  3 of 3 START sql table model dbt_jyeo_prod.from_snappy ......................... [RUN]
23:58:45  3 of 3 OK created sql table model dbt_jyeo_prod.from_snappy .................... [SUCCESS 1 in 2.09s]
23:58:45  
23:58:45  Finished running 2 table models, 1 snapshot in 0 hours 0 minutes and 13.79 seconds (13.79s).
23:58:45  
23:58:45  Completed successfully
23:58:45  
23:58:45  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

$ mv target target_old

Modify model foo.sql to be something different then do a ci run:

$ dbt build --target ci --select state:modified+ --defer --state target_old

00:00:42  Running with dbt=1.5.6
00:00:44  Registered adapter: snowflake=1.5.3
00:00:45  Found 2 models, 0 tests, 1 snapshot, 0 analyses, 323 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
00:00:45  
00:00:49  Concurrency: 1 threads (target='ci')
00:00:49  
00:00:49  1 of 1 START sql table model dbt_jyeo_ci.foo ................................... [RUN]
00:00:51  1 of 1 OK created sql table model dbt_jyeo_ci.foo .............................. [SUCCESS 1 in 2.37s]
00:00:51  
00:00:51  Finished running 1 table model in 0 hours 0 minutes and 6.07 seconds (6.07s).
00:00:51  
00:00:51  Completed successfully
00:00:51  
00:00:51  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Env-awareness of the target_schema config seems to have worked and only modified model foo as expected.