Note — this article is intended for:
- anyone that writes dbt modeling packages (likely a consultant or vendor)
- OR anyone who likes seeing fancy things done with dbt
In other words: you probably don’t need this.
Scenario
Let’s say you’re a consultant that’s working with a number of merchants that use the same ecommerce software — Shopiary™. Your clients replicate the data into their warehouses using their favorite EL tool, and most of the time, the tables (and columns within the tables) are the same for each client — with one exception. Some stores implement a rewards program, and have a corresponding rewards_status
table, whereas other stores do not.
You decide to write a dbt package to re-use across your clients (and open source it so the community can benefit too ), but you need to figure out a way to handle the fact that sometimes the rewards_status
table exists, and sometimes it doesn’t.
Part 1: Figuring out if a table exists in your warehouse
As someone who keeps up with the latest dbt developments, you’re using sources in your dbt package. So you add the rewards_status
table to your shopiary
source:
version: 2
sources:
- name: shopiary
tables:
- name: customers
- name: orders
- name: payments
- name: rewards_status # we're not sure if this exists for all users of this package
An important note, it’s probably a good idea to remove any freshness
checks on this table (otherwise you’ll get an error):
...
- name: rewards_status
freshness: null
In your downstream model (say, stg_rewards_status
) you know the first thing you need to do is figure out a way to check if this table actually exists. If we naïvely write logic like this:
{% set table_exists=source('shopiary', 'rewards_status') is not none %}
{% if table_exists %}
{{ log("Table exists", info=True) }}
{% else %}
{{ log("Table does not exist", info=True) }}
{% endif %}
dbt will return table_exists=true
, since we’re actually checking for the existence of the Source in your dbt project, rather than the existence of the table in your database (this is very nuanced).
$ dbt compile
Running with dbt=0.17.1
Table exists
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources
15:03:57 | Concurrency: 1 threads (target='learn')
15:03:57 |
Table exists
15:03:58 | Done.
So we need to do something slightly different. We can use the get_relation
macro to instead check your database instead of your dbt project:
{%- set source_relation = adapter.get_relation(
database=source('shopiary', 'rewards_status').database,
schema=source('shopiary', 'rewards_status').schema,
identifier=source('shopiary', 'rewards_status').name) -%}
{% set table_exists=source_relation is not none %}
{% if table_exists %}
{{ log("Table exists", info=True) }}
{% else %}
{{ log("Table does not exist", info=True) }}
{% endif %}
$ dbt compile
Running with dbt=0.17.1
Table does not exist
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources
15:07:00 | Concurrency: 1 threads (target='learn')
15:07:00 |
Table does not exist
15:07:01 | Done.
Cool! We can set a variable that tells us “this table doesn’t exist”.
Let’s quickly check that it works for the positive case too (i.e. this isn’t a false negative) — we know customers
exists for everyone:
{%- set source_relation = adapter.get_relation(
database=source('shopiary', 'customers').database,
schema=source('shopiary', 'customers').schema,
identifier=source('shopiary', 'customers').name) -%}
{% set table_exists=source_relation is not none %}
{% if table_exists %}
{{ log("Table exists", info=True) }}
{% else %}
{{ log("Table does not exist", info=True) }}
{% endif %}
$ dbt compile
Running with dbt=0.17.1
Table does not exist
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources
15:08:09 | Concurrency: 1 threads (target='learn')
15:08:09 |
Table exists
15:08:10 | Done.
Huh, weird, it said it didn’t exist, and then it did exist… what gives?
Part 2: Understanding the difference between the parse
time and run time
When you execute dbt compile
or dbt run
, dbt parses your project twice:
-
First parse: dbt reads all of the files in your project and generates a “manifest” comprised of models, tests, and other graph nodes present in your project. During this phase, dbt uses the
ref
statements it finds to generate the DAG for your project. No SQL is run during this phase. - Second parse: dbt compiles (and runs) each node (eg. building models, or running tests). SQL is run during this phase
If you’ve ever used the execute
macro, you might have come across this behaviour
So, the reason customers “does not exist” and then suddenly “exists” is because:
-
First parse: no SQL run, so the
source_relation
gets returned asnone
, resulting in a false negative. -
Second parse: This time dbt actually runs SQL against the
information_schema
of your database to check if the table exists. Thesource_relation
gets returned when it exists.
OK, so maybe we can just use the results of the second parse to enable or model?
Part 3: “Enabling” and “disabling” models based on whether the table exists ()
It may seem like a reasonable way to handle this optional table is to use the enabled
config to turn models on and off based on the existence of the table.
This might look something like:
{%- set source_relation = adapter.get_relation(
database=source('shopiary', 'rewards_status').database,
schema=source('shopiary', 'rewards_status').schema,
identifier=source('shopiary', 'rewards_status').name) -%}
{% set table_exists=source_relation is not none %}
{{ config(enabled=table_exists) }}
select ...
Unfortunately, this is not possible:
- dbt needs to know if a model is enabled at parse time, and since we’re relying on querying the database to check if a model is exists (i.e run time), things are gonna get hairy!
- Even if there were a workaround (maybe there is — I gave up on this approach when I got to here ¯_(ツ)_/¯), this introduces complexity in any downstream models that
ref
thestg_rewards_status
model — you’d need to selectively enable/disable that part of your code
Part 4: Fake it 'til you make it
Fortunately, there is a cool workaround here!
- Check if the table exists, using some of the code we just wrote
- If it exists, write the transformation the way you intend to
- If it doesn’t, create a dummy, empty table instead with the exact same column names. Make sure you cast your
null
s to the right data-types in case they need to match datatype in downstream joins.
{# Not all stores have a concept of a rewards status, so it won't exist in every deployment #}
{%- set source_relation = adapter.get_relation(
database=source('shopiary', 'rewards_status').database,
schema=source('shopiary', 'rewards_status').schema,
identifier=source('shopiary', 'rewards_status').name) -%}
{% set table_exists=source_relation is not none %}
{% if table_exists %}
select
id as reward_id,
customer_id,
tier
from {{ source('shopiary', 'rewards_status') }}
{% else %}
select
null::integer as reward_id,
null::integer as customer_id,
null::integer as tier
-- this means there will be zero rows
where false
{% endif %}
Now, you can ref
the model, and join to the stg_rewards_status
table, just like any other model — null
values will just flow though instead of real values. If your client later starts syncing the rewards_status
table, the information will flow through as expected!
select
customers.customer_id,
customers.first_name,
customers.last_name,
rewards_status.tier as rewards_tier -- this will just be null if the table doesn't exist
from {{ ref('stg_customers')}} as customers
left join {{ ref('stg_rewards_status') }} as rewards_status using (customer_id)
Part 5: Other options!
There’s always more than one way to peel a potato. You could also investigate:
- Abstracting the
table_exists
logic into a macro — a great idea if there are multiple optional tables! - Creating the source table in the warehouse (this probably looks something like running a manual
create table
statement, using the role used by your EL tool). More complexity before getting to dbt, but keeps the package tidy. - Having a user defined variable in your
dbt_project.yml
for whether the models related torewards_status
are enabled or not, and leaving it up to your user to set it totrue
orfalse
. More configuration, but less code in your dbt project.