Building dbt models to be compatible with multiple data warehouses

The best solution is no solution…

Even though this is not applicable to your use case, it needs to be said here (in case others are looking at this):
The best way to solve this problem is to avoid it! Unless you are trying to maintain parity across both warehouses, I strongly recommend using two separate dbt projects.

Further, I don’t recommend maintaining parity across two warehouses – “source of truth” becomes very messy quickly!

We sometimes see this need arise when someone is maintaining a legacy warehouse while transitioning to a new one. In this case, I’d also recommend not trying to share models across two warehouses, and instead trying to move your dbt project over in one go!

But, since it can’t be avoided in this case, it’s worth sharing some wisdom we’ve collected when building for compatibility across multiple warehouses :slight_smile: . This advice will also likely be useful for anyone trying to build packages!

Challenges when modeling on different warehouses

You’ll likely encounter the following challenges:

  1. The SQL syntaxes vary slightly across warehouses
  2. Your data may be loaded in a different structure
  3. A performant model design varies across warehouses
  4. Performance levers in dbt vary across warehouses

1. The SQL syntaxes vary slightly across warehouses

Why is this an issue?

Some of the non-ANSI functions vary across warehouses.
For example, on Redshift, to calculate the difference between two dates, you might do:

datediff('day', first_date, second_date)

But on BigQuery, you need to use a slightly different function, and reverse the order of your args :exploding_head:

datetime_diff(second_date, first_date, 'day')

And on postgres this function doesn’t exist at all. Bonkers!

Using dbt-utils / adapter macros for cross-database SQL functions

Fortunately, you can use macros to handle the logic of which SQL function you should use. Instead of writing the SQL, you can call a macro which returns the correct SQL for your warehouse, like this:

{{ datediff("first_date", "second_date", "'day'") }}

^ This assumes that you have a datediff macro that would return the correct SQL for your warehouse.

Even more fortunately, we’ve written a whole bunch of cross-database compatible macros in our dbt-utils package! (Just quietly, we think this is the best repository of knowledge around the syntactical differences between warehouses out there.)

You can leverage this work by installing dbt-utils as a package, and then using the utils functions whenever you have SQL that needs to differ across your warehouses.

If there is a specific SQL function that we haven’t written a macro for, we welcome PRs! Or you can also add your own macro to your project, following the adapter_macro pattern we use in dbt-utils.

One thing to note though is that there’s a trade-off to be aware of here: using macros can reduce the readability of your models, and quoting strings correctly can be confusing (see above), so be mindful of using them too much.

2. Your data may be loaded in a different structure

Why is this an issue?

If you use off-the-shelf products (Stitch, Fivetran) to load data into your warehouse, these services may load the data slightly differently. I have more experience with Stitch, so I’ll speak to the differences they have across warehouses as an example:

  • Stitch loads data into BQ in an “append only” fashion
  • Stitch unpacks/flattens JSON objects on Redshift (since Redshift doesn’t handle JSON well) but may not on other warehouses
  • Timestamp types might differ across warehouses

Using Jinja directly in your model

If a column name, or type, varies slightly across your two warehouses, I’d handle this with an if statement in your model:

select
{% if target.type == 'redshift' %}
    my_redshift_column as my_column,
{% elif target.type == 'bigquery' %}
    my_bq_column as my_column,
{% else %}
    {{ exceptions.raise_compiler_error(target.type ~" not supported in this project") }}
{% endif %}
    my_other_column

from my_table

On the Stitch front, @dylanbaker wrote some macros to handle the different way Stitch loads data in each warehouse, check them out here.

Using separate models with different enabled options

If there is a huge structural difference between the data that gets loaded into each of your warehouses, it may be worth splitting the SQL for the models into two separate files.

A handy tip here is that you can have two models with the same name, so long as only one of them is enabled. A pattern we’ve started using lately is to use a jinja variable to determine whether a model is enabled or not, as seen here.

You might create a directory of “staging” models for each warehouse (see this article for more details on how we structure projects):

├── dbt_project.yml
└── models
    └── staging
        └── braintree
            ├── bigquery
            |   └── stg_braintree__customers.sql
            └── redshift
                └── stg_braintree__customers.sql

Then in each model define the following [1]:

{{
    config(
        enabled = target.type == 'redshift'
    )
}}

Using adapter macros (not recommended)

Most of our packages use a pattern where each model calls a macro, and then the macro uses the adapter pattern to choose which SQL to use (for example, our Snowplow package).

However, we’ve found that this severely limits the readability of your code, so I’d advise against following this pattern. We are probably going to move away from it in future iterations of these packages!

3. A performant model design varies across warehouses

Credit to @josh for pointing this out.

Why is this an issue?

Different warehouses favor different model designs – for example, BigQuery pushes in favor of denormalization, so you should leverage nested objects rather than joins. Redshift doesn’t easily unnest objects, so they should be flattened first.

If you want to maintain parity between your two warehouses, you’ll have to choose which warehouse you’ll design for and make it the first class citizen, and then just squish your other warehouse to work with that data structure.

4. Performance levers in dbt vary across warehouses

Why is this an issue?

The way to make models performant varies:

  • Redshift uses sort and dist keys on tables
  • BigQuery uses date partitioning
  • Snowflake works the way you expect without any additional configuration

As such, the configs you supply to your dbt model will differ. For example, a Redshift model might have:

{{
    config(
        materialized='table',
        sort='created_at',
        dist='user_id',
    )
}}

Whereas a BigQuery model might have:

{{
    config(
        materialized='table',
        partition_by='date(created_at)',
        cluster_by='user_id'
    )
}}

Supplying all the configs to shared models

Since dbt doesn’t mind if you supply a config that it doesn’t recognize, I would just add all the things (i.e. performance configs):

{{
    config(
        materialized='table',
        sort='created_at',
        dist='user_id',
        partition_by='date(created_at)',
        cluster_by='user_id'
    )
}}

[1] In an ideal world, you would be able to enable your projects in your dbt_project.yml file:

  my_project:
    staging:
      braintree:
        redshift:
          enabled: "target.type == 'redshift'"
        bigquery:
          enabled: "target.type == 'bigquery'"

However, because when dbt parses the dbt_project.yml file it doesn’t yet know which profile to use (since the profile info is stored in that file!), it also doesn’t know about your target.type so won’t be able to parse this correctly.

There’s a related github issue, feel free to :+1: it!

2 Likes