How do I specify a different schema for my source at run time?

Is it possible for me to specify a different schema at the DBT command line to override the one that is present in my source .yml file?

Here’s an example of what we have today:

sources:
  - name: IC
    description: "IC data from MySQL Prod"
    database: ic_reporting
    schema: icmysql

Currently all of my staging tables are referencing IC as a source. I’d like to be able to change the schema to icmysqlqa when I’m running this code via my DBT Cloud job that runs for our QA environment. For Production and Dev, I’d like for this value to be as is.

I also thought about creating another source called IC_qa but that would involve repeating the metadata I already have in the other source and that wouldn’t be good for maintenance.

Thanks
W

Hey @chanwd - this is a great question! I think you generally have two options here:

1. Use a var()

In this approach, you can supply a variable on the command line (or in your dbt Cloud step) which renders into the definition of the source. In practice, this might look like:

sources:
  - name: IC
    description: “IC data from MySQL Prod”
    database: ic_reporting
    schema: "{{ var('ic_schema', 'icmysql') }}"

This configuration sets up a dbt Variable called ic_schema with a default value of 'icmysql'. This value can be overridden on the command line using the --vars flag, eg:

$ dbt run --vars 'ic_schema: icmysqlqa'

When this variable is supplied, dbt will set the source schema for the IC dataset to be icmysqlqa instead of icmysql.

This is a pretty good approach if you have a single dataset where you want to toggle between dev and prod source datasets. It doesn’t scale quite as well if you have more than one such dataset, or if you want to introduce additional environment-specific datasets.

2. Use YAML Anchors

The YAML spec supports Anchors, which should work in dbt source definitions. Anchors (and aliases) could be a good way to maintain separate definitions for these sources without duplicating any attached metadata/tests.

In practice, this might look like:

version: 2

sources:
    - name: ic_prod
      schema: icmysql
      tables: &tables
          - name: table1
            description: "abc123"
          - name: table2
            description: "def456"

    - name: ic_qa
      schema: icmysqlqa
      tables: *tables

This config would create two difference sources with an identical set of tables. With this syntax, you can also overrides tests/descriptions for specific environments if necessary - check out the “Override values” section in the link shared above!

The big benefit here is flexibility: you can have a dedicated source for each dataset + environment pair. If you go this route, you can switch between them in you staging models, eg:

-- models/ic/stg_model.sql

select  *

{% if target.name == 'qa' %}
from {{ source('ic_qa', 'table_1') }}
{% else %}
from {{ source('ic', 'table_1') }}
{% endif %}

Let us know which of these you end up going with!

Thanks for taking the time to answer my question so thoroughly and promptly.

I opted for the 2nd option and created a 2nd source to represent my QA database information while also leveraging the aliases to avoid repeating information within the yml file.

I then created a macro that takes two parameters (schema and table) and called it db_source.

Something like this:

{% macro db_source(source_name, table_name) %}

{% if target.name == 'qa' %} 
{{ source(source_name + '_qa', table_name)}}
{% else %}
{{ source(source_name, table_name)}}
{% endif %}

{% endmacro %}

And then referencing it in all of my staging table SQL as:

from {{ db_source('ic', 'fct_table') }}

The one thing I do want to point out with the macro is that the expectation is that each of your sources are distinguished by a suffix, so QA should have a section with the same name as production but with _qa to distinguish the yml source name.

Something like this:

sources:
  - name: ic
  - name: ic_qa