Writing packages when a source table may or may not exist

Note — this article is intended for:

  • anyone that writes dbt modeling packages (likely a consultant or vendor)
  • OR anyone who likes seeing fancy things done with dbt :wink:

In other words: you probably don’t need this.

Scenario

Let’s say you’re a consultant that’s working with a number of merchants that use the same ecommerce software — Shopiary™. Your clients replicate the data into their warehouses using their favorite EL tool, and most of the time, the tables (and columns within the tables) are the same for each client — with one exception. Some stores implement a rewards program, and have a corresponding rewards_status table, whereas other stores do not.

You decide to write a dbt package to re-use across your clients (and open source it so the community can benefit too :wink: ), but you need to figure out a way to handle the fact that sometimes the rewards_status table exists, and sometimes it doesn’t.

Part 1: Figuring out if a table exists in your warehouse

As someone who keeps up with the latest dbt developments, you’re using sources in your dbt package. So you add the rewards_status table to your shopiary source:

version: 2

sources:
- name: shopiary
  tables:
    - name: customers
    - name: orders
    - name: payments
    - name: rewards_status # we're not sure if this exists for all users of this package
    

An important note, it’s probably a good idea to remove any freshness checks on this table (otherwise you’ll get an error):

    ...
    - name: rewards_status 
      freshness: null

In your downstream model (say, stg_rewards_status) you know the first thing you need to do is figure out a way to check if this table actually exists. If we naïvely write logic like this:

{% set table_exists=source('shopiary', 'rewards_status') is not none %}

{% if table_exists %}

{{ log("Table exists", info=True) }}

{% else %}

{{ log("Table does not exist", info=True) }}

{% endif %}

dbt will return table_exists=true, since we’re actually checking for the existence of the Source in your dbt project, rather than the existence of the table in your database (this is very nuanced).

$ dbt compile
Running with dbt=0.17.1
Table exists
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources

15:03:57 | Concurrency: 1 threads (target='learn')
15:03:57 |
Table exists
15:03:58 | Done.

So we need to do something slightly different. We can use the get_relation macro to instead check your database instead of your dbt project:

{%- set source_relation = adapter.get_relation(
      database=source('shopiary', 'rewards_status').database,
      schema=source('shopiary', 'rewards_status').schema,
      identifier=source('shopiary', 'rewards_status').name) -%}

{% set table_exists=source_relation is not none %}

{% if table_exists %}

{{ log("Table exists", info=True) }}

{% else %}

{{ log("Table does not exist", info=True) }}

{% endif %}

$ dbt compile
Running with dbt=0.17.1
Table does not exist
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources

15:07:00 | Concurrency: 1 threads (target='learn')
15:07:00 |
Table does not exist
15:07:01 | Done.

Cool! We can set a variable that tells us “this table doesn’t exist”.

Let’s quickly check that it works for the positive case too (i.e. this isn’t a false negative) — we know customers exists for everyone:

{%- set source_relation = adapter.get_relation(
      database=source('shopiary', 'customers').database,
      schema=source('shopiary', 'customers').schema,
      identifier=source('shopiary', 'customers').name) -%}

{% set table_exists=source_relation is not none  %}

{% if table_exists %}

{{ log("Table exists", info=True) }}

{% else %}

{{ log("Table does not exist", info=True) }}

{% endif %}

$ dbt compile
Running with dbt=0.17.1
Table does not exist
Found 6 models, 25 tests, 0 snapshots, 0 analyses, 268 macros, 0 operations, 0 seed files, 4 sources

15:08:09 | Concurrency: 1 threads (target='learn')
15:08:09 |
Table exists
15:08:10 | Done.

Huh, weird, it said it didn’t exist, and then it did exist… what gives?

Part 2: Understanding the difference between the parse time and run time

When you execute dbt compile or dbt run, dbt parses your project twice:

  • First parse: dbt reads all of the files in your project and generates a “manifest” comprised of models, tests, and other graph nodes present in your project. During this phase, dbt uses the ref statements it finds to generate the DAG for your project. No SQL is run during this phase.
  • Second parse: dbt compiles (and runs) each node (eg. building models, or running tests). SQL is run during this phase

If you’ve ever used the execute macro, you might have come across this behaviour

So, the reason customers “does not exist” and then suddenly “exists” is because:

  • First parse: no SQL run, so the source_relation gets returned as none, resulting in a false negative.
  • Second parse: This time dbt actually runs SQL against the information_schema of your database to check if the table exists. The source_relation gets returned when it exists.

OK, so maybe we can just use the results of the second parse to enable or model?

Part 3: “Enabling” and “disabling” models based on whether the table exists (:no_good_woman:)

It may seem like a reasonable way to handle this optional table is to use the enabled config to turn models on and off based on the existence of the table.

This might look something like:

{%- set source_relation = adapter.get_relation(
      database=source('shopiary', 'rewards_status').database,
      schema=source('shopiary', 'rewards_status').schema,
      identifier=source('shopiary', 'rewards_status').name) -%}

{% set table_exists=source_relation is not none  %}

{{ config(enabled=table_exists) }}

select ...

Unfortunately, this is not possible:

  • dbt needs to know if a model is enabled at parse time, and since we’re relying on querying the database to check if a model is exists (i.e run time), things are gonna get hairy!
  • Even if there were a workaround (maybe there is — I gave up on this approach when I got to here ¯_(ツ)_/¯), this introduces complexity in any downstream models that ref the stg_rewards_status model — you’d need to selectively enable/disable that part of your code :grimacing:

Part 4: Fake it 'til you make it :sunglasses:

Fortunately, there is a cool workaround here!

  • Check if the table exists, using some of the code we just wrote
  • If it exists, write the transformation the way you intend to
  • If it doesn’t, create a dummy, empty table instead with the exact same column names. Make sure you cast your nulls to the right data-types in case they need to match datatype in downstream joins.
{# Not all stores have a concept of a rewards status, so it won't exist in every deployment #}

{%- set source_relation = adapter.get_relation(
      database=source('shopiary', 'rewards_status').database,
      schema=source('shopiary', 'rewards_status').schema,
      identifier=source('shopiary', 'rewards_status').name) -%}

{% set table_exists=source_relation is not none   %}

{% if table_exists %}

select
    id as reward_id,
    customer_id,
    tier

from {{ source('shopiary', 'rewards_status') }}

{% else %}

select
    null::integer as reward_id,
    null::integer as customer_id,
    null::integer as tier

-- this means there will be zero rows
where false

{% endif %}

Now, you can ref the model, and join to the stg_rewards_status table, just like any other model — null values will just flow though instead of real values. If your client later starts syncing the rewards_status table, the information will flow through as expected!

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        rewards_status.tier as rewards_tier -- this will just be null if the table doesn't exist

    from {{ ref('stg_customers')}} as customers
    left join {{ ref('stg_rewards_status') }} as rewards_status using (customer_id)

:sunglasses: :sunglasses: :sunglasses:

Part 5: Other options!

There’s always more than one way to peel a potato. You could also investigate:

  • Abstracting the table_exists logic into a macro — a great idea if there are multiple optional tables!
  • Creating the source table in the warehouse (this probably looks something like running a manual create table statement, using the role used by your EL tool). More complexity before getting to dbt, but keeps the package tidy.
  • Having a user defined variable in your dbt_project.yml for whether the models related to rewards_status are enabled or not, and leaving it up to your user to set it to true or false. More configuration, but less code in your dbt project.
8 Likes

I had trouble applying this to external tables in redshift. Turns out, as of this posting, the get_relation call does not include external tables in redshift. There is an open issue for this https://github.com/fishtown-analytics/dbt/issues/2289 which includes a potential work around.