dbt snapshots and time-range joins
dbt Snapshots are the foundation of our data modeling layer. They are used to preserve the history of how data has changed over time. Capturing this history is invaluable for debugging and critical for certain business functions where we either need to track how an attribute changed that wasn’t captured in the source system, or when we need a particular report to not change as subsequent data updates happen (e.g., billing). However, snapshot data can be more difficult to work with, particularly when it comes to joining two different tables that have been subject to snapshotting.
To demonstrate the issue, we’ll consider a somewhat uninteresting case of having to join an “engagement” snapshot with a “partner” snapshot, where the names of the records were changed at different times, and we want to capture those name changes.
Given an engagements
snapshot that looks like (note that for practical purposes, the NULL has been replaced by actual date far in the future 2999-12-31 23:59:59
– aka, the Y3K bug):
engagement_sfid | partner_sfid | engagement_name | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|
eng1 | part1 | Engagement Old | 2021-10-08 00:01:01 | 2021-10-10 00:00:01 |
eng1 | part1 | Engagement New | 2021-10-10 00:00:01 | 2999-12-31 23:59:59 |
and a partners
snapshot that looks like:
partner_id | partner_name | dbt_valid_from | dbt_valid_to |
---|---|---|---|
part1 | Partner Old | 2021-10-08 00:00:01 | 2021-10-12 00:00:01 |
part1 | Partner New | 2021-10-12 00:00:01 | 2999-12-31 23:59:59 |
How do we join the two tables together to capture the combined record effective dates? Simply joining engagements.partner_sfid
to partners.partner_sfid
would give us the cartesian product of all the records, including a record where “Engagement Old” is paired with “Partner New”, which never actually happened because “Engagement Old” was not effective when “Partner Old” was changed to “Partner New”. What we want is a joined table that looks like:
engagement_sfid | partner_sfid | engagement_name | partner_name | valid_from | valid_to |
---|---|---|---|---|---|
eng1 | part1 | Engagement Old | Partner Old | 2021-10-08 00:01:01 | 2021-10-10 00:00:01 |
eng1 | part1 | Engagement New | Partner Old | 2021-10-10 00:00:01 | 2021-10-12 00:00:01 |
eng1 | part1 | Engagement New | Partner New | 2021-10-12 00:00:01 | 2999-12-31 23:59:59 |
Fortunately, most of this problem has already been solved, and I’ve implemented it in a dbt macro. To use it, you need to do some light data preparation of the tables involved as CTEs and then use the macro to specify the CTEs involved and how they join together:
WITH engagements AS (
SELECT
engagement_sfid,
engagement_name,
partner_sfid,
engagement_dotdotdot,
dbt_scd_id,
dbt_valid_from,
dbt_valid_to
FROM
{{ ref('engagements') }}
),
partners AS (
SELECT
partner_sfid,
partner_name,
partner_dotdotdot,
dbt_scd_id,
dbt_valid_from,
dbt_valid_to
FROM
{{ ref('partners') }}
),
{{
trange_join(
left_model='engagements',
left_fields=['engagement_sfid', 'engagement_name', 'partner_sfid', 'engagement_dotdotdot'],
left_primary_key='engagement_sfid',
right_models={
'partners': {
'fields': ['partner_sfid', 'partner_name', 'partner_dotdotdot'],
'left_on': 'partner_sfid',
'right_on': 'partner_sfid',
}
}
)
}},
SELECT * FROM trange_final