Note — this article is intended for:
- anyone that writes dbt modeling packages (likely a consultant or vendor)
- OR anyone who likes seeing fancy things done with dbt
In other words: you probably don’t need this.
Scenario
Let’s say you’re a consultant that’s working with a number of merchants that use the same ecommerce software — Shopiary™. Your clients replicate the data into their warehouses using their favorite EL tool, and most of the time, the tables (and columns within the tables) are the same for each client — with one exception. Some stores implement a rewards program, and have a corresponding rewards_status table, whereas other stores do not.
You decide to write a dbt package to re-use across your clients (and open source it so the community can benefit too
), but you need to figure out a way to handle the fact that sometimes the rewards_status table exists, and sometimes it doesn’t.
Part 1: Figuring out if a table exists in your warehouse
As someone who keeps up with the latest dbt developments, you’re using sources in your dbt package. So you add the rewards_status table to your shopiary source:
version: 2
sources:
- name: shopiary
tables:
- name: customers
- name: orders
- name: payments
- name: rewards_status # we're not sure if this exists for all users of this package
An important note, it’s probably a good idea to remove any freshness checks on this table (otherwise you’ll get an error):
...
- name: rewards_status
freshness: null
In your downstream model (say, stg_rewards_status) you know the first thing you need to do is figure out a way to check if this table actually exists. If we naïvely write logic like this:
{% set table_exists=source('shopiary', 'rewards_status') is not none %}
{% if table_exists %}
{{ log("Table exists", info=True) }}
{% else %}
{{ log("Table does not exist", info=True) }}
{% endif %}
dbt will return table_exists=true, since we’re actually checking for the existence of the Source in your dbt project, rather than the existence of the table in your database (this is very nuanced).
$ dbt compile
Running with dbt=0.17.1
Table exists
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources
15:03:57 | Concurrency: 1 threads (target='learn')
15:03:57 |
Table exists
15:03:58 | Done.
So we need to do something slightly different. We can use the get_relation macro to instead check your database instead of your dbt project:
{%- set source_relation = adapter.get_relation(
database=source('shopiary', 'rewards_status').database,
schema=source('shopiary', 'rewards_status').schema,
identifier=source('shopiary', 'rewards_status').name) -%}
{% set table_exists=source_relation is not none %}
{% if table_exists %}
{{ log("Table exists", info=True) }}
{% else %}
{{ log("Table does not exist", info=True) }}
{% endif %}
$ dbt compile
Running with dbt=0.17.1
Table does not exist
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources
15:07:00 | Concurrency: 1 threads (target='learn')
15:07:00 |
Table does not exist
15:07:01 | Done.
Cool! We can set a variable that tells us “this table doesn’t exist”.
Let’s quickly check that it works for the positive case too (i.e. this isn’t a false negative) — we know customers exists for everyone:
{%- set source_relation = adapter.get_relation(
database=source('shopiary', 'customers').database,
schema=source('shopiary', 'customers').schema,
identifier=source('shopiary', 'customers').name) -%}
{% set table_exists=source_relation is not none %}
{% if table_exists %}
{{ log("Table exists", info=True) }}
{% else %}
{{ log("Table does not exist", info=True) }}
{% endif %}
$ dbt compile
Running with dbt=0.17.1
Table does not exist
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources
15:08:09 | Concurrency: 1 threads (target='learn')
15:08:09 |
Table exists
15:08:10 | Done.
Huh, weird, it said it didn’t exist, and then it did exist… what gives?
Part 2: Understanding the difference between the parse time and run time
When you execute dbt compile or dbt run, dbt parses your project twice:
-
First parse: dbt reads all of the files in your project and generates a “manifest” comprised of models, tests, and other graph nodes present in your project. During this phase, dbt uses the
refstatements it finds to generate the DAG for your project. No SQL is run during this phase. - Second parse: dbt compiles (and runs) each node (eg. building models, or running tests). SQL is run during this phase
If you’ve ever used the execute macro, you might have come across this behaviour
So, the reason customers “does not exist” and then suddenly “exists” is because:
-
First parse: no SQL run, so the
source_relationgets returned asnone, resulting in a false negative. -
Second parse: This time dbt actually runs SQL against the
information_schemaof your database to check if the table exists. Thesource_relationgets returned when it exists.
OK, so maybe we can just use the results of the second parse to enable or model?
Part 3: “Enabling” and “disabling” models based on whether the table exists (
)
It may seem like a reasonable way to handle this optional table is to use the enabled config to turn models on and off based on the existence of the table.
This might look something like:
{%- set source_relation = adapter.get_relation(
database=source('shopiary', 'rewards_status').database,
schema=source('shopiary', 'rewards_status').schema,
identifier=source('shopiary', 'rewards_status').name) -%}
{% set table_exists=source_relation is not none %}
{{ config(enabled=table_exists) }}
select ...
Unfortunately, this is not possible:
- dbt needs to know if a model is enabled at parse time, and since we’re relying on querying the database to check if a model is exists (i.e run time), things are gonna get hairy!
- Even if there were a workaround (maybe there is — I gave up on this approach when I got to here ¯_(ツ)_/¯), this introduces complexity in any downstream models that
refthestg_rewards_statusmodel — you’d need to selectively enable/disable that part of your code
Part 4: Fake it 'til you make it
Fortunately, there is a cool workaround here!
- Check if the table exists, using some of the code we just wrote
- If it exists, write the transformation the way you intend to
- If it doesn’t, create a dummy, empty table instead with the exact same column names. Make sure you cast your
nulls to the right data-types in case they need to match datatype in downstream joins.
{# Not all stores have a concept of a rewards status, so it won't exist in every deployment #}
{%- set source_relation = adapter.get_relation(
database=source('shopiary', 'rewards_status').database,
schema=source('shopiary', 'rewards_status').schema,
identifier=source('shopiary', 'rewards_status').name) -%}
{% set table_exists=source_relation is not none %}
{% if table_exists %}
select
id as reward_id,
customer_id,
tier
from {{ source('shopiary', 'rewards_status') }}
{% else %}
select
null::integer as reward_id,
null::integer as customer_id,
null::integer as tier
-- this means there will be zero rows
where false
{% endif %}
Now, you can ref the model, and join to the stg_rewards_status table, just like any other model — null values will just flow though instead of real values. If your client later starts syncing the rewards_status table, the information will flow through as expected!
select
customers.customer_id,
customers.first_name,
customers.last_name,
rewards_status.tier as rewards_tier -- this will just be null if the table doesn't exist
from {{ ref('stg_customers')}} as customers
left join {{ ref('stg_rewards_status') }} as rewards_status using (customer_id)
![]()
Part 5: Other options!
There’s always more than one way to peel a potato. You could also investigate:
- Abstracting the
table_existslogic into a macro — a great idea if there are multiple optional tables! - Creating the source table in the warehouse (this probably looks something like running a manual
create tablestatement, using the role used by your EL tool). More complexity before getting to dbt, but keeps the package tidy. - Having a user defined variable in your
dbt_project.ymlfor whether the models related torewards_statusare enabled or not, and leaving it up to your user to set it totrueorfalse. More configuration, but less code in your dbt project.