Run a model in production schema from development environment

ZH6 · May 4, 2023, 5:50am

Hello,

We are in dbt Cloud. Many times (especially for incremental models) I would like to run a model only one time with specific set of parameters, which I don’t want to push into the main branch. I would like to test a dbt run in development environment and then I would like to run the same code in production. Same, meaning that all the ref() functions and so on use the production schema instead.

Is that possible somehow, please?

joellabes · May 4, 2023, 6:58am

Can you explain why? dbt’s core premise is one of idempotency - when you run the same transformation you will get the same results. Why do you want to have some changes applied to your table only once and not have them remain if you rebuilt the model in the future?

ZH6 · May 4, 2023, 12:25pm

As I mentoined, it’s an incremental model, where I need to calculate the whole history. I’m doing a daily customer snapshot. I loop day by day and UNION with previous result. It’s a lot of data. I don’t want to run it every day.
I know that I can create a project variable, based on which I will define a time span, for which the data will be calculated. I’m doing this in another model. But also here I have to merge branch to repository then go to Deploy → Jobs, then update the setting of a manual job. I find this a bit redundant.
Isn’t there a possibility to dbt run with a flag, which would tell which schema to use?

joellabes · May 5, 2023, 4:15am

Your instincts are correct - this doesn’t sound right. Can you share the code of the model you’re trying to build?

If your incremental model is set up in line with best practices, you shouldn’t need to be building it in development every day and then redeploying.

It should be something like

with upstream_model as (
  select * from {{ ref('upstream_model') }}
), 
final as (
select 
  all,
  the
  transformations,
  modified_date
from upstream_model
{% if is_incremental() %}
  where upstream_model.modified_date > (select max(this.modified_date) from {{ this }} as this)
{% endif %}

So that you’re only processing net-new data and you’re only processing it in the production environment.

There are Write-Audit-Publish paradigms which enable a blue/green deployment (e.g. Blue-Green Data Warehouse Deployments (Write-Audit-Publish) with BigQuery and dbt | Calogica.com), but even those tend to assume you’re promoting identical production resources as opposed to promoting your hand-crafted staging model.

The reason I’m pushing back here is that it sounds like you are a manual part of your orchestration process, by having to manually update the query and run it every time you want new data loaded; you shouldn’t need to be doing that.

ZH6 · May 5, 2023, 9:35am

I’m doing the manual update only once upon a time. For example when some logic changes or mostly due to adding a new column.

In most cases we are trying to follow the best practices.

Short description for what this model should be doing:

creates a weekly customer table, containing some specific business important metrics
each day the current and last week (defined in mondays_query) is recalculated and a snapshot for that given week is created
for given week we always look at user activities on the web in the past 90 days (roughly) (see WHERE clause)
I iterate over weeks (2 in standard run) and select from quite a large table. The result inserts or overwrites data in the destination table
but once upon a time I need to recalculate the whole thing. The compiled code is then quite large and I guess it uses a lot of resources for computation…

Here is the code (shortened for readability):

-- depends_on: {{ ref('ga_events') }}

{{
    config(
        materialized = 'incremental',
        unique_key = ['week_monday', 'user_id']
    )
}}

{%- set mondays_query -%}
    SELECT
        DISTINCT CAST(DATE_TRUNC(generated_day, ISOWEEK) AS STRING) AS week_monday
    FROM UNNEST(GENERATE_DATE_ARRAY({{ get_week_start_date(-7) }}, {{ get_week_start_date() }})) as generated_day
{%- endset -%}

{%- set rows = run_query(mondays_query) -%}

{%- if execute -%}
    {%- set mondays_list = rows.columns[0].values() -%}
{%- else -%}
    {%- set mondays_list = [] -%}
{%- endif -%}

{% for monday in mondays_list %}

SELECT
    CAST(EXTRACT(YEAR FROM DATE_TRUNC('{{ monday }}', ISOWEEK)) AS STRING) || '_' || LPAD(CAST(EXTRACT(WEEK FROM DATE_TRUNC('{{ monday }}', ISOWEEK)) AS STRING), 2, '0') AS week,
    some_other_columns
FROM {{ref('ga_events')}}
WHERE event_date BETWEEN DATE_ADD('{{ monday }}', INTERVAL -91 DAY) AND DATE_ADD('{{ monday }}', INTERVAL 6 DAY)
GROUP BY user_id

{% if not loop.last %} UNION ALL {% endif %}
{%- endfor %}

We don’t use the snapshot models because they are difficult to join. That’s why we do snapshots in this fashion.

ZH6 · May 12, 2023, 8:39am

@joellabes can I get an update on this topic, please?

Topic		Replies	Views
What are the dbt commands you run in your production deployment of dbt? Archive	12	30251	March 30, 2021
dbt Cloud run in deployment only one model of job Help orchestration-and-deployment , dbt-cloud	2	2314	January 13, 2023
BigQuery + dbt: Incremental Changes Archive	8	40446	June 19, 2020
Incremental model runs only like "create or replace table..." Help incremental , databricks	6	4134	October 24, 2023
Model run scheduling patterns Archive	2	4963	March 5, 2020

Run a model in production schema from development environment

Related topics