Fivetran: How to run models frequently enough but not too frequently

The problem I’m having

We’re hitting our database (Snowflake) far harder now that we’ve switched to dbt, resulting in extreme costs.
By default, Fivetran puts the + operator to the left of our model names so as to run upstream models as well. We need this in order to ensure dependencies are refreshed first.

This creates two problems:

  1. We’re re-running heavy models that are commonly hit. For example, we may have a heavy sales model that combines data from three tables into one denormalized table. This should only be run daily. But because our hourly models depend on it, they end up triggering it to be re-run every hour.
  2. We’re needlessly re-creating 1000+ views in our staging layer repeatedly throughout the day. Because every model at some point relies on a staging model, the staging model gets re-run many many times throughout the day. These staging models are simply views with very light transformations, so each one takes very little time. That said, due to the number of them, it becomes a problem. It creates no value for us to recreate these views, because their definitions are always the same (and since they’re views, their data is also always the same).

The context of why I’m trying to do this

We use the + operator in front of our models because most of our models require at least 2-3 other models before finally coming together in a “final” or “output” model.

What I’ve already tried

The only thing I can really think to do is to change from ref() to source() wherever we’re hitting something we don’t want to re-run frequently. But that feels inconsistent with dbt’s purpose and we therefore lose a lot of advantages of dbt by so doing.

I considered trying to put the + operator after more upstream models, but that won’t necessarily guarantee everything necessary gets run.

How do others do this with Fivetran? Even if we ignore the first part, how do others have a staging layer that isn’t constantly getting re-created for no reason?

I’ve tried looking at yaml selectors and applying a default property, but the docs state that they only apply on an “unqualified” command and will be ignored if any selectors are embedded in the command (which Fivetran always does).

For further context, Fivetran’s logs show that it always executes a command like the following:
dbt run --models +model_1 +model_2 +model_7

Hey @ptrn, just to clarify, are you using Fivetran’s Transformations for dbt Core product?

I haven’t used it (obviously we use dbt Cloud internally as well as at my last company, which has full control over the DAG selection when you configure a job), but at a guess: are you able to control the selectors at all, e.g. by managing the job as code instead?

Hey, Joel! I really appreciate your taking a look at this.

Yes, we’re using Fivetran’s Transformations for dbt Core (specifically, the “Scheduled in Fivetran” version) product.

We can switch to their “Scheduled in Code” version, which I’m going to try today. That’s the only way we can control the selectors.

I’m curious, though, what the best practice is to prevent re-running our staging layer, for example, every time even using selectors. Would we just always append something like --exclude staging on every single command?

Your best bet for now is to switch to “Scheduled in Code”. You lose the benefit of coordinating between syncs and dbt models, though. Our long-term goal is that the “Integrated Schedule” mode can handle all cases, but we’re not there yet.

1 Like

Hi @ptrn! As you were already advised to gain full control over selectors you might switch to “Scheduled in Code” transformations, but you also have an option to run your intermediate models with “Partially integrated” schedule. In this case it is possible to skip the unnecessary model runs while keeping your output models in “Fully integrated” mode running after upstream connectors. In any case I advise you to create a Fivetran support ticket and ask to route it to me, so the team could check your use case in depth.

Thanks for the idea.
Before I create a ticket to pursue this, I want to clarify: You’re saying that if I put an intermediate model on a “Partially Integrated” schedule, that other models in “Fully Integrated” mode which are dependent on this intermediate model will skip it when they execute?

I’m trying to work out the best way to do this using YAML Selectors.

I notice in the docs that it says:

exclude … is always applied last within its scope.

Let’s say I have the following lineage for an output model used by a Tableau report:

inventory --------------------->
int_products_enhanced --------->
                                 --> rpt_tableau
sales ------------------------->

And the following lineage used by my sales model:

int_products_enhanced ----->
stg_brick_mortar_sales ---->
                             ---> sales
stg_ecommerce_sales ------->

You’ll notice that int_products_enhanced is used in both transformations.
The sales model, along with all it’s parents, is the heavy model I want to exclude (which I run daily).

That said, because the lineage for the rpt_tableau model also directly references int_products_enhanced, I do want int_products_enhanced to be re-run when rpt_tableau and it’s parents are run.

Given that exclude selectors apply last and the following selectors.yml, int_products_enhanced doesn’t get re-run. Suggestions on how I could alter it so that it does get re-run due to it’s being directly referenced by rpt_tableau?

selectors:
    - name: reporting_layer_refresh_excluding_daily_and_staging
      definition:
        union:
          - method: fqn
            value: rpt_tableau
            parents: true
            children: false
          - exclude:
                - method: fqn
                  value: sales
                  parents: true
                  children: false

@ptrn here is more detailed description of Partially Integrated mode.

Hi,
Have you considered using incremental models ?
In your exemple, I would suggest creating an incremental model for the heavy sales table, so that its refresh is instantaneous when it was already run on a given day.
You could just filter the input data with the max(date) of your incremental model.

(I am no Fivetran user, so my answer my be a bit off)

1 Like