Unioning identically-structured data sources

If you’re working with an app that has one database per client, or a company that has one Shopify store for each ecommerce store they manage, it’s often a good idea to union together each of these tables into one master table.

Here’s how I would do it for a client that has a separate Shopify store for each country of operation.

1. Use sources

version: 2

sources:
  - name: shopify_au
    tables: 
      - name: orders
      - name: customers

  - name: shopify_us
    tables: 
      - name: orders
      - name: customers

Then, if you want to get fancy, use yaml anchors to reduce duplicated code:

version: 2

sources:
  - name: shopify_au
    tables:  &shopify_tables
      - name: orders
      - name: customers

  - name: shopify_us
    tables: *shopify_tables

This extends pretty well when working with lots of countries! Unfortunately at present there’s no way to iterate over a list in your source definition.

2. Union together your sources in one layer

For each table, I’d create a model whose only function is to union together the different sources (not to do any renaming/recasting).

Why only union in this model? I am a big advocate for doing one logical thing in each model (when that logical thing is relatively easy to define), as it makes it easier to explain to someone what the model does. I like to think of things in terms of “contracts”/a guarantee to end users, which says “this model unions data only”, as I find these sorts of models are easier to maintain

In keeping with the way we structure out projects, I’d name this something like models/staging/shopify/union/union_shopify__orders.sql.

In an ideal world, where the tables have the exact same columns and the same column order, you can simply do something like this:

{% set countries =  ['au', 'us'] %}

{% for country in countries %}
  select 
      *,
      '{{ country }}' as country
  from {{ source(country, 'orders') %}
{% if not loop.last -%} union all {%- endif %}
{% endfor %}

:point_up: This has the advantage of being reasonably readable for someone who is still getting used to Jinja.

If your columns are in a strange order/some are missing from a table, you can use the union_relations macro to magic over that:

{% set countries =  ['au', 'us'] %}
{% set shopify_order_sources = [] %}
{% for country in countries %}
{% do shopify_order_sources.append(source('shopify_' ~ country, 'orders') %}
{% endfor %}

{{ dbt_utils.union_relations(country_order_sources) }}

OK, a little more complex, but you can still figure out what’s happening!

From here, you can choose how much macro-magic you want to apply. Some options:

  1. Use a var in your dbt_project.yml file to set the countries list
  2. Create a union_shopify_sources macros that has the logic for building the shopify_order_sources list, so you can call it like so (obviously name it appropriately for your use-case)
{{ union_shopify_sources(countries=['au', 'us'], table_name = 'orders') }}
  1. Parse the graph jinja variable (currently undocumented since it’s liable to change) to dynamically build the list of countries without having to manually define it. Here’s a hint:
{% macro get_sources(source_prefix, table_name) %}
    {% set sources = [] %}
    {% if execute %}
    {% for node_name, node in graph.nodes.items() %}
        {% if node.resource_type == 'source' and node.name == table_name and node.source_name.startswith(source_prefix) %}
            {% set new_source = source(node.source_name, node.name) %}
            {% do sources.append(new_source) %}
        {% endif %}
    {% endfor %}
    {% do return(sources) %}
    {% endif %}
{% endmacro %}

(This has the disadvantage of not rendering correctly in the docs)

A word of caution: All of these add complexity to your project — it’s much harder for someone who didn’t write this logic to understand exactly what is going on and make changes when required. Consider what’s right for your use case — if you are unioning 70 countries and 45 tables, it is totally appropriate to use macros, but if you have 2 countries and 3 tabes, go with the simple version first!

4 Likes

Excellent write-up!

Our team is taking a similar approach for making a master metric table that unions a bunch of staged metric tables together. We are using the dbt_utils.get_relations_by_prefix and dbt_utils.union_relations combination and found it very powerful and readable.

Is it possible to force this “master” model to run after the staged models? Our current approach is to run dbt twice, once with --models tag:staging and then --models tag:master. This doesn’t feel like the best approach, but it does allow us to generate the “master” model from successful staged models, regardless of failed staged models - which is desired.

Hi Joe,

So the get_relations_by_prefix macro returns a list of Relations (a view or table) in the warehouse, rather than a list of models. As a result, dbt can’t infer the dependencies between the upstream model and your master model.

There’s no real way to fix this, your workaround seems reasonable! The other option is to use subdirectories to group together your models and run them like --models path.to.subdirectory.*

Your idea is a good one though, check out this related issue:

1 Like