Hey @daniellang – welcome
A little late to the party here, but hopefully you’ll let me join.
My preferred approach looks like @warpraptor’s! We use stitch to replicate each database into a separate schema, then, in dbt, union them all together.
Here are some tips:
1. We define our raw data as sources.
A neat trick here is to use yaml anchors to reduce duplication of code:
version: 2
sources:
- name: shopify_us
tables: &shopify_tables
- name: products
columns:
- name: id
tests:
- unique
- not_null
- name: orders
columns:
- name: id
tests:
- unique
- not_null
- name: shopify_uk
tables: *shopify_tables
- name: shopify_au
tables: *shopify_tables
2. Use jinja in dbt to union the tables together
For our first version, we’ll often just union together the tables manually (using a little Jinja to help us out)
{%- set countries = ['us', 'uk', 'au'] -%}
with products as (
{% for country in countries %}
select * from {{ source('shopify_', ~ country, 'products') }}
{% if not loop.last %} union all {% endif %}
{% endfor %)
),
...
Then, inevitably, there will be some table that has the columns in a different order. So we’ll use the union_relations macro from dbt utils here to help us out:
{%- set countries = ['us', 'uk', 'au'] -%}
{%- set shopify_product_relations=[] -%}
{% for country in countries %}
{% do shopify_product_relations.append(source('shopify_' ~ country, 'products')) %}
{% endfor %}
with products as (
{{ dbt_utils.union_relations(shopify_product_relations) }}
),
...
As the number of tables we need to union grows, we end up repeating the first 6 lines of this across multiple models. So we might then abstract it into a macro. Here’s an example:
{% macro union_shopify_sources(table_name) %}
{%- set countries=['us', 'uk', 'au'] -%}
{%- set shopify_relations=[] -%}
{% for country in countries %}
{% do shopify_relations.append(source('shopify_' ~ country, table_name)) %}
{% endfor %}
{{ dbt_utils.union_relations(shopify_relations) }}
{% endmacro %}
Then our downstream model looks like:
with products as (
{{ union_shopify_sources('products') }}
),
...
Even though we’ve reduced the repetition in our code, there’s a tradeoff: we’ve obfuscated this code a little – it’s harder to understand exactly what’s happening without going to another code location.
3. Prefix unique IDs with country codes
My last tip for this kind of work is that I generally create a new unique ID for the unioned data by prefixing the old ID with a country code, e.g. US-187
, AU-87546
. I find that this is the most user-friendly version of the unique ID, and it allows me to write clean join
s downstream (since I also do this for foreign keys). In one project, I even went so far as to macro-fy this logic in my union_shopify_sources
, so that any field named id
or ending with _id
automatically had the country code prefixed.