Unioning identically-structured data sources

If you’re working with an app that has one database per client, or a company that has one Shopify store for each ecommerce store they manage, it’s often a good idea to union together each of these tables into one master table.

Here’s how I would do it for a client that has a separate Shopify store for each country of operation.

1. Use sources

version: 2

sources:
  - name: shopify_au
    tables: 
      - name: orders
      - name: customers

  - name: shopify_us
    tables: 
      - name: orders
      - name: customers

Then, if you want to get fancy, use yaml anchors to reduce duplicated code:

version: 2

sources:
  - name: shopify_au
    tables:  &shopify_tables
      - name: orders
      - name: customers

  - name: shopify_us
    tables: *shopify_tables

This extends pretty well when working with lots of countries! Unfortunately at present there’s no way to iterate over a list in your source definition.

2. Union together your sources in one model

For each table, I’d create a model whose only function is to union together the different sources (not to do any renaming/recasting).

Why only union in this model? I am a big advocate for doing one logical thing in each model (when that logical thing is relatively easy to define), as it makes it easier to explain to someone what the model does. I like to think of things in terms of “contracts”/a guarantee to end users, which says “this model unions data only”, as I find these sorts of models are easier to maintain

In keeping with the way we structure out projects, I’d name this something like models/staging/shopify/union/union_shopify__orders.sql.

In an ideal world, where the tables have the exact same columns and the same column order, you can simply do something like this:

{% set countries =  ['au', 'us'] %}

{% for country in countries %}
  select 
      *,
      '{{ country }}' as country
  from {{ source(country, 'orders') }}
{% if not loop.last -%} union all {%- endif %}
{% endfor %}

:point_up: This has the advantage of being reasonably readable for someone who is still getting used to Jinja.

If your columns are in a strange order/some are missing from a table, you can use the union_relations macro to magic over that:

{% set countries =  ['au', 'us'] %}
{% set shopify_order_sources = [] %}
{% for country in countries %}
{% do shopify_order_sources.append(source('shopify_' ~ country, 'orders') %}
{% endfor %}

{{ dbt_utils.union_relations(shopify_order_sources) }}

OK, a little more complex, but you can still figure out what’s happening!

From here, you can choose how much macro-magic you want to apply. Some options:

  1. Use a var in your dbt_project.yml file to set the countries list
  2. Create a union_shopify_sources macros that has the logic for building the shopify_order_sources list, so you can call it like so (obviously name it appropriately for your use-case)
{{ union_shopify_sources(countries=['au', 'us'], table_name = 'orders') }}
  1. Parse the graph jinja variable (currently undocumented since it’s liable to change) to dynamically build the list of countries without having to manually define it. Here’s a hint:
{% macro get_sources(source_prefix, table_name) %}
    {% set sources = [] %}
    {% if execute %}
    {% for node_name, node in graph.nodes.items() %}
        {% if node.resource_type == 'source' and node.name == table_name and node.source_name.startswith(source_prefix) %}
            {% set new_source = source(node.source_name, node.name) %}
            {% do sources.append(new_source) %}
        {% endif %}
    {% endfor %}
    {% do return(sources) %}
    {% endif %}
{% endmacro %}

(This has the disadvantage of not rendering correctly in the docs)

A word of caution: All of these add complexity to your project — it’s much harder for someone who didn’t write this logic to understand exactly what is going on and make changes when required. Consider what’s right for your use case — if you are unioning 70 countries and 45 tables, it is totally appropriate to use macros, but if you have 2 countries and 3 tabes, go with the simple version first!

3. Construct new unique keys

If you’re using multiple shopify stores, and your source data uses incrementing keys, you’re going to end up with orders.id = 1 for multiple orders!

My preference is to construct new unique keys like AU-1, US-1 instead: these keys are really easy to read and make debugging much easier than constructing a hashed key.

There’s a few different approaches we’ve taken over time:

  • The layer after our union_shopify__orders model is stg_shopify__orders, and we’ll write the SQL explictly there (again, way more readable)
select
  upper(country) || '-' || id as order_id,
  upper(country) || '-' || customer_id as customer_id,
  ...

  • We might get a little fancy in our union_shopify__orders model with some Jinja directly in there:
{% set countries =  ['au', 'us'] %}
{# assume all tables have the same columns, which is fine, because we are selecting * later based on this assumption #}
{% set columns = get_columns_in_relation(source('shopify_au', 'orders') %}

{% for country in countries %}
  select 
      {% for col in columns %}
      {% if col.name == 'id' or  col.name[-3:] == '_id' %}
      upper('{{ country }}') || '-' || {{ col.name }} as {{ col.name }},
      {% else %}
      {{ col.name }},
      {% endif %}
      {% endfor %}
      '{{ country }}' as country
  from {{ source(country, 'orders') }}
{% if not loop.last -%} union all {%- endif %}
{% endfor %}


  • If we chose to create our own macro to union together our sources, we might embed that logic directly in the macro.

But what is we have hundreds of tables in each schema?

The above approach means you would need to create hundreds of models to create the union-ed version of the data (and each model would contain a call to a single macro).

At some point, this starts to break down. An alternative approach might be to use an operation to:

  1. List all the objects in a schema
  2. Iterate through that list and create a “super” view in a new schema, which unions all the countries together — the code will likely look similar to the above macro
  3. Create sources in your dbt project that point to the super view
  4. Run this operation at the start of any dbt job

With this approach, you lose the ability to view the lineage of the super view in your project’s docs (since lineage depends on using sources and refs), however it does declutter your project. FWIW, I think this approach is warranted when doing this at scale — if transformations are “mechanical” (i.e. easy to define expected input and output), operations can come in handy here :slight_smile:

11 Likes

Excellent write-up!

Our team is taking a similar approach for making a master metric table that unions a bunch of staged metric tables together. We are using the dbt_utils.get_relations_by_prefix and dbt_utils.union_relations combination and found it very powerful and readable.

Is it possible to force this “master” model to run after the staged models? Our current approach is to run dbt twice, once with --models tag:staging and then --models tag:master. This doesn’t feel like the best approach, but it does allow us to generate the “master” model from successful staged models, regardless of failed staged models - which is desired.

Hi Joe,

So the get_relations_by_prefix macro returns a list of Relations (a view or table) in the warehouse, rather than a list of models. As a result, dbt can’t infer the dependencies between the upstream model and your master model.

There’s no real way to fix this, your workaround seems reasonable! The other option is to use subdirectories to group together your models and run them like --models path.to.subdirectory.*

Your idea is a good one though, check out this related issue:

2 Likes

Hello,

These approaches in this article have been very helpful for our team, as we deal with a growing mutli source system. One question: we have some unique sources where YAML Overrides would help keep the config clean and readable. Does anyone know if that feature supported in dbt? We’ve tried unsuccessfully to implement it…

YAML overrides are indeed supported in dbt! Feel free to join us in Slack and share your question there — I think we’d need some more info to help debug here (code snippets etc.)

Hi Claire,

This is very inspiring!
Can you shed me some lights on how to do a left join ... using ... instead of union all in similar use case?

Thanks

Turned out that I figured out how to use left join (in case of different tables have different columns that need be consolidated together into one)

{% set selected_list = var('selected_lists') %}

select *
{% for model in selected_list %}
  {% if loop.first %}
  from {{ ref(model) }}
  {% else %}
  left outer join {{ ref(model) }}
    using(provider_key, contract_class)
  {% endif %}
{% endfor %}

@s.chaser you could also check out union_relations from dbt_utils

1 Like

question, if i have three variable that need to loop, is it possible to loop three section based on three different variables? for example, i have below three variable, and i want to have three {{for in }} in the select section, how can i do that, i tried directly put all three variable in the top and add three {{for in }} in the select section, only first variable loop works, the rest two doesn;t work, any ideal?
{% set ledger_type =[‘AA’,‘E1’,‘AJ’,‘AU’,‘BA’,‘1F’,‘2F’,‘3F’,‘4F’,‘1B’,‘2B’,‘3B’,‘4B’,‘JA’,‘PA’,‘HA’,‘JY’ ,‘EB’,‘F1’,‘F2’,‘F3’,‘F4’,‘B1’,‘B2’,‘B3’,‘B4’]%}

{% set acct_cat_code =[‘1’, ‘2’, ‘3’, ‘7’, ‘4’, ‘5’, ‘6’,‘8’, ‘9’,‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’,‘20’, ‘21’, ‘22’, ‘23’, ‘24’]%}

{% set bu_cat_code =[‘01’, ‘02’, ‘03’, ‘04’, ‘05’, ‘07’, ‘09’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘18’, ‘19’, ‘20’, ‘21’, ‘22’, ‘23’, ‘06’, ‘08’, ‘16’, ‘17’, ‘25’,‘29’,‘30’,‘31’, ‘24’,‘26’, ‘27’, ‘28’]%}

Hi Joel, is there one idea or step using unions_relations to avoid the duplicates using “Primary_key” Fields that I have on both… tables

A bit of an update:
The Fivetran free dbt package has a nice wrapup for get_relation and union_relations for the task.

fivetran_utils.union_data

Example:

Model:

 {{config(materialized='table')}}
 
 
 {{
     fivetran_utils.union_data(
         table_identifier='T1',
         default_schema = 'public'
     )
 }}

Execution variant:
C:\Users\serge\venvs\srtd\Scripts\dbt.exe run --vars "union_databases: ['raw','analytics']"

Generated code:

(
             select
 
                 cast('"RAW"."PUBLIC"."T1"' as 
     varchar
 ) as _dbt_source_relation,
                 
                     cast("VL" as character varying(3)) as "VL" 
 
             from "RAW"."PUBLIC"."T1"
 
             
         )
 
         union all
         
 
         (
             select
 
                 cast('"ANALYTICS"."PUBLIC"."T1"' as 
     varchar
 ) as _dbt_source_relation,
                 
                     cast("VL" as character varying(3)) as "VL" 
 
             from "ANALYTICS"."PUBLIC"."T1"
 
             
         )

Snowflake result:

Slack discussion:
https://getdbt.slack.com/archives/CBSQTAPLG/p1657208903142319?thread_ts=1656702086.994559&cid=CBSQTAPLG

Hi Claire! What if my final union statement is too large for my redshift to proccess it? Is there a solution for that? Something like incremental inserting with looping by schema name?

Hi Claire,
Thanks for the brilliant explanation and example! Really helpful!

I have tables that are ingested by Fivetran weekly and the names of the tables are stored in a table named schema_name.fivetran_audit, so I can get the list for the set = part, but this comes as a table, so I am changing the list (['au', 'us'], in your example) for a run_query() function. The problem is that I’m getting an error that is not very helpful to me:

 syntax error at end of input
    LINE 4:
              ^

Do you have any idea about how I could tackle this error? ChatGPT wasn’t much of a help so far either…