If you’re working with an app that has one database per client, or a company that has one Shopify store for each ecommerce store they manage, it’s often a good idea to union together each of these tables into one master table.
Here’s how I would do it for a client that has a separate Shopify store for each country of operation.
1. Use sources
version: 2
sources:
- name: shopify_au
tables:
- name: orders
- name: customers
- name: shopify_us
tables:
- name: orders
- name: customers
Then, if you want to get fancy, use yaml anchors to reduce duplicated code:
version: 2
sources:
- name: shopify_au
tables: &shopify_tables
- name: orders
- name: customers
- name: shopify_us
tables: *shopify_tables
This extends pretty well when working with lots of countries! Unfortunately at present there’s no way to iterate over a list in your source definition.
2. Union together your sources in one model
For each table, I’d create a model whose only function is to union together the different sources (not to do any renaming/recasting).
Why only union in this model? I am a big advocate for doing one logical thing in each model (when that logical thing is relatively easy to define), as it makes it easier to explain to someone what the model does. I like to think of things in terms of “contracts”/a guarantee to end users, which says “this model unions data only”, as I find these sorts of models are easier to maintain
In keeping with the way we structure out projects, I’d name this something like models/staging/shopify/union/union_shopify__orders.sql
.
In an ideal world, where the tables have the exact same columns and the same column order, you can simply do something like this:
{% set countries = ['au', 'us'] %}
{% for country in countries %}
select
*,
'{{ country }}' as country
from {{ source(country, 'orders') }}
{% if not loop.last -%} union all {%- endif %}
{% endfor %}
This has the advantage of being reasonably readable for someone who is still getting used to Jinja.
If your columns are in a strange order/some are missing from a table, you can use the union_relations macro to magic over that:
{% set countries = ['au', 'us'] %}
{% set shopify_order_sources = [] %}
{% for country in countries %}
{% do shopify_order_sources.append(source('shopify_' ~ country, 'orders') %}
{% endfor %}
{{ dbt_utils.union_relations(shopify_order_sources) }}
OK, a little more complex, but you can still figure out what’s happening!
From here, you can choose how much macro-magic you want to apply. Some options:
- Use a
var
in yourdbt_project.yml
file to set thecountries
list - Create a
union_shopify_sources
macros that has the logic for building theshopify_order_sources
list, so you can call it like so (obviously name it appropriately for your use-case)
{{ union_shopify_sources(countries=['au', 'us'], table_name = 'orders') }}
- Parse the
graph
jinja variable (currently undocumented since it’s liable to change) to dynamically build the list of countries without having to manually define it. Here’s a hint:
{% macro get_sources(source_prefix, table_name) %}
{% set sources = [] %}
{% if execute %}
{% for node_name, node in graph.nodes.items() %}
{% if node.resource_type == 'source' and node.name == table_name and node.source_name.startswith(source_prefix) %}
{% set new_source = source(node.source_name, node.name) %}
{% do sources.append(new_source) %}
{% endif %}
{% endfor %}
{% do return(sources) %}
{% endif %}
{% endmacro %}
(This has the disadvantage of not rendering correctly in the docs)
A word of caution: All of these add complexity to your project — it’s much harder for someone who didn’t write this logic to understand exactly what is going on and make changes when required. Consider what’s right for your use case — if you are unioning 70 countries and 45 tables, it is totally appropriate to use macros, but if you have 2 countries and 3 tabes, go with the simple version first!
3. Construct new unique keys
If you’re using multiple shopify stores, and your source data uses incrementing keys, you’re going to end up with orders.id = 1
for multiple orders!
My preference is to construct new unique keys like AU-1
, US-1
instead: these keys are really easy to read and make debugging much easier than constructing a hashed key.
There’s a few different approaches we’ve taken over time:
- The layer after our
union_shopify__orders
model isstg_shopify__orders
, and we’ll write the SQL explictly there (again, way more readable)
select
upper(country) || '-' || id as order_id,
upper(country) || '-' || customer_id as customer_id,
...
- We might get a little fancy in our
union_shopify__orders
model with some Jinja directly in there:
{% set countries = ['au', 'us'] %}
{# assume all tables have the same columns, which is fine, because we are selecting * later based on this assumption #}
{% set columns = get_columns_in_relation(source('shopify_au', 'orders') %}
{% for country in countries %}
select
{% for col in columns %}
{% if col.name == 'id' or col.name[-3:] == '_id' %}
upper('{{ country }}') || '-' || {{ col.name }} as {{ col.name }},
{% else %}
{{ col.name }},
{% endif %}
{% endfor %}
'{{ country }}' as country
from {{ source(country, 'orders') }}
{% if not loop.last -%} union all {%- endif %}
{% endfor %}
- If we chose to create our own macro to union together our sources, we might embed that logic directly in the macro.
But what is we have hundreds of tables in each schema?
The above approach means you would need to create hundreds of models to create the union-ed version of the data (and each model would contain a call to a single macro).
At some point, this starts to break down. An alternative approach might be to use an operation to:
- List all the objects in a schema
- Iterate through that list and create a “super” view in a new schema, which unions all the countries together — the code will likely look similar to the above macro
- Create sources in your dbt project that point to the super view
- Run this operation at the start of any dbt job
With this approach, you lose the ability to view the lineage of the super view in your project’s docs (since lineage depends on using sources and refs), however it does declutter your project. FWIW, I think this approach is warranted when doing this at scale — if transformations are “mechanical” (i.e. easy to define expected input and output), operations can come in handy here