Why are CI jobs ("state" method) building more/less models/nodes than expected?

Starting a topic to discuss scenarios where using the state:modified method (and it’s variation) are resulting dbt running more or less models (nodes) than one might expect.


To understand how to debug dbt Cloud Slim CI runs yourself when you think “too many/few models are being built”, watch this quick loom explainer on how that works in general:

Source properties are different

Assuming we have a dbt project (profiles.yml) setup with 2 different target names:

# ~/.dbt/profiles.yml
snowflake:
  target: prod
  outputs:
    prod: &sf-creds
      type: snowflake
      ...
    ci: *sf-creds

# dbt_project.yml
name: my_dbt_project
profile: snowflake
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table

# models/sources.yml
# There is the same `users` source table here but two of it exists - one in each database.
version: 2
sources:
  - name: company_foo
    database: '{{ "development" if target.name == "prod" else "development_jyeo" }}'
    tables:
      - name: users
-- models/foo_1.sql
select * from {{ source('company_foo', 'users') }}

-- models/foo_2.sql
select * from {{ ref('foo_1') }}

-- models/bar_1.sql
select 1 as id

-- models/bar_2.sql
select * from {{ ref('bar_2') }}

Note that dbt Cloud also has target names configurable per dbt Cloud job. The job shown here has a target.name == 'ci':

  1. First lets generate a manifest.json of a production run - which will be deferred to later in our subsequent ci run. Be sure to move the generate artifacts to it’s own folder as well.
$ dbt ls --target prod
my_dbt_project.bar_1
my_dbt_project.bar_2
my_dbt_project.foo_1
my_dbt_project.foo_2
source:my_dbt_project.company_foo.users
$ mv target target_old

Some users do dbt compile instead which works too. Most commonly, this would be dbt run as production jobs are meant to be building models. In special circumstances do users want the job being deferred to to be an ls or compile job.

  1. Now let’s modify our model bar_1.sql but not anything else.
-- models/bar_1.sql
select 2 as id
  1. Let’s do a run with state deferral.
$ dbt run -s state:modified --defer --state target_old --target ci
22:51:46  Running with dbt=1.4.5
22:51:47  Found 4 models, 0 tests, 0 snapshots, 0 analyses, 307 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
22:51:47  
22:51:52  Concurrency: 1 threads (target='ci')
22:51:52  
22:51:52  1 of 1 START sql table model dbt_jyeo.bar_1 .................................... [RUN]
22:51:57  1 of 1 OK created sql table model dbt_jyeo.bar_1 ............................... [SUCCESS 1 in 4.65s]
22:51:57  
22:51:57  Finished running 1 table model in 0 hours 0 minutes and 10.18 seconds (10.18s).
22:51:57  
22:51:57  Completed successfully
22:51:57  
22:51:57  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Note that dbt Cloud automatically does deferral to the job selected - we don’t need --state --defer with dbt Cloud. You also don’t need to specify --target since that would be set in the dbt Cloud job UI.

Here - there is no confusion - we modified model bar_1 and indeed it was the only one that was changed.

  1. Most of the time though, we also want to run models that are downstream since changes to bar_1, could also affect bar_2 - and to do this, we add a plus (+) to our selector. Let’s try that:
$ dbt run -s state:modified+ --defer --state target_old --target ci
22:56:37  Running with dbt=1.4.5
22:56:38  Found 4 models, 0 tests, 0 snapshots, 0 analyses, 307 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
22:56:38  
22:56:44  Concurrency: 1 threads (target='ci')
22:56:44  
22:56:44  1 of 4 START sql table model dbt_jyeo.bar_1 .................................... [RUN]
22:56:49  1 of 4 OK created sql table model dbt_jyeo.bar_1 ............................... [SUCCESS 1 in 4.54s]
22:56:49  2 of 4 START sql table model dbt_jyeo.foo_1 .................................... [RUN]
22:56:53  2 of 4 OK created sql table model dbt_jyeo.foo_1 ............................... [SUCCESS 1 in 4.13s]
22:56:53  3 of 4 START sql table model dbt_jyeo.bar_2 .................................... [RUN]
22:56:57  3 of 4 OK created sql table model dbt_jyeo.bar_2 ............................... [SUCCESS 1 in 4.23s]
22:56:57  4 of 4 START sql table model dbt_jyeo.foo_2 .................................... [RUN]
22:57:02  4 of 4 OK created sql table model dbt_jyeo.foo_2 ............................... [SUCCESS 1 in 4.72s]
22:57:02  
22:57:02  Finished running 4 table models in 0 hours 0 minutes and 23.71 seconds (23.71s).
22:57:02  
22:57:02  Completed successfully
22:57:02  
22:57:02  Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4

Note we’re still deferring to the manifest.json generated earlier with dbt ls in step (1) and not the manifest.json generated in step (3).

Now, many more models have been executed. For bar_2 - it’s not surprising since bar_2 is directly downstream of bar_1. But why foo_1 (and subsequently foo_2)? Well, that’s due to how we’ve defined our source which foo_1 uses - recall:

# models/sources.yml
version: 2
sources:
  - name: company_foo
    database: '{{ "development" if target.name == "prod" else "development_jyeo" }}'
    tables:
      - name: users

In our production job (target == ‘prod’), the source evaluated to development.company_foo.users but in our ci job (target == ‘ci’), the source is instead development_jyeo.company_foo.users - this means that YES, the source is detected as being modified. But of course, even if a source is modified, run -s state:modifed doesn’t actually do ANYTHING to sources (since sources are not run). But as soon as we do run -s state:modified+ - we would be “running” things that are downstream of the changed source - which is of course foo_1 (and foo_2 by extension).

1 Like

Parent/child nodes ARE modified (whether on purpose or not) but have been excluded (via --exclude)

This is another scenario that may trip folks up as you have to not only understand “state” and the “plus” (+) graph operator but also it’s interactions with node exclusions at the same time.

Let’s setup a toy project like so (copied from the previous post):

# ~/.dbt/profiles.yml
snowflake:
  target: prod
  outputs:
    prod: &sf-creds
      type: snowflake
      ...
    ci: *sf-creds

# dbt_project.yml
name: my_dbt_project
profile: snowflake
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table

And some models like:

-- models/foo.sql
select 1 as id

-- models/bar.sql
select 1 as id

-- models/staging/stg.sql
{{ config(materialized='incremental', unique_key='id') }}
select * from {{ ref('foo') }}

We can assume for a second that models in the staging folder specifically - we don’t want them to be run during Slim CI jobs because they perhaps contain or process a lot of data (of course this is not the case with this toy example but just imagine it :slight_smile:) - we want to manually test them. What does this mean? This means that you most likely will be running CI jobs with a command that’s similar to:

dbt run --select ... --exclude staging

Replace ... above with some variation of state:modifed.


Okay, now let’s do our first “production” run.

$ dbt run --full-refresh
01:12:06  Running with dbt=1.4.5
01:12:07  Found 3 models, 0 tests, 0 snapshots, 0 analyses, 308 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
01:12:07  
01:12:13  Concurrency: 1 threads (target='default')
01:12:13  
01:12:13  1 of 3 START sql table model dbt_jyeo.bar ...................................... [RUN]
01:12:17  1 of 3 OK created sql table model dbt_jyeo.bar ................................. [SUCCESS 1 in 3.64s]
01:12:17  2 of 3 START sql table model dbt_jyeo.foo ...................................... [RUN]
01:12:20  2 of 3 OK created sql table model dbt_jyeo.foo ................................. [SUCCESS 1 in 3.52s]
01:12:20  3 of 3 START sql incremental model dbt_jyeo.stg ................................ [RUN]
01:12:24  3 of 3 OK created sql incremental model dbt_jyeo.stg ........................... [SUCCESS 1 in 4.00s]
01:12:24  
01:12:24  Finished running 2 table models, 1 incremental model in 0 hours 0 minutes and 16.96 seconds (16.96s).
01:12:24  
01:12:24  Completed successfully
01:12:24  
01:12:24  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

Nothing surprising here - all 3 models are run as expected. Now let’s move our target folder as we did previously so we can defer to the manifest.json that was generated above.

$ mv target target_old

And then let’s make two changes.

(1) Let’s modify model bar:

-- models/bar.sql
select 2 as id

(2) Let’s modify our dbt_project.yml like so:

# dbt_project.yml
name: my_dbt_project
profile: snowflake
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table
    staging:
      +incremental_strategy: "delete+insert"
# ^ These last 2 lines are newly added to the file ^ #

Note, in very large projects with many models - you MAY not even know that by adding those 2 lines - you have inadvertently caused the incremental model stg to be “modified” since model configs can be set in many places (the model’s own config() block, in the dbt_project.yml file - like we did here, or even it’s property yml file).

Okay, now let’s do a CI run:

dbt run -s +state:modified+ --exclude models/staging --defer --state target_old

This command says to:

  1. Build anything that’s modified.
  2. Build anything that’s modified and all it’s child / downstream nodes.
  3. Build anything that’s modified and all it’s parent / upstream nodes.
  4. Do not build anything that is excluded - which is simply the model stg.

Let’s see what happens:

$ dbt run -s +state:modified+ --exclude models/staging --defer --state target_old
01:13:37  Running with dbt=1.4.5
01:13:39  Found 3 models, 0 tests, 0 snapshots, 0 analyses, 308 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
01:13:39  
01:13:45  Concurrency: 1 threads (target='default')
01:13:45  
01:13:45  1 of 2 START sql table model dbt_jyeo.bar ...................................... [RUN]
01:13:48  1 of 2 OK created sql table model dbt_jyeo.bar ................................. [SUCCESS 1 in 3.67s]
01:13:48  2 of 2 START sql table model dbt_jyeo.foo ...................................... [RUN]
01:13:52  2 of 2 OK created sql table model dbt_jyeo.foo ................................. [SUCCESS 1 in 3.25s]
01:13:52  
01:13:52  Finished running 2 table models in 0 hours 0 minutes and 12.75 seconds (12.75s).
01:13:52  
01:13:52  Completed successfully
01:13:52  
01:13:52  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

As we can see above - we didn’t modify foo but foo was included in the +state:modified+ selection. Why is that? Because we modified it’s child stg’s configuration (by adding an incremental_strategy config to it where one wasn’t there before). Thus stg is modified and upstream of stg is foo.

The exclusion --exclude models/staging simply means “exclude the node itself from running” - and that’s all it means and not more - as in - it does not mean “exclude the node itself from running plus any parent/child nodes from/of it”.