I’d like to setup my fact tables as incremental, but am not sure how to make sure they load correctly from multiple sources / refs.
Say I have a fact_a , which pulls from ref_b and ref_c. I’d like fact_a to refresh in cases where eitherref_b or ref_c has been updated according to my incremental logic, but can’t figure out how to do this correctly.
An example of what I want:
ref_b table
ref_b_id
ref_b_value
ref_b_updated_at
1
abcdef
‘2023-05-30’
2
ghijk
‘2023-05-15’
3
lmno
‘2023-04-30’
ref_c table
ref_c_id
ref_b_id
ref_c_value
ref_c_updated_at
1
1
opqr
‘2023-05-30’
2
2
stuv
‘2023-05-30’
3
4
wxyz
‘2023-04-30’
Let’s say we join on ref_b_id, which is the unique key for the fact_a table. If I’m generating incremental rows for fact_a filtering on updated_at > '2023-05-29', I want the new rows to be inserted to be:
fact_a
ref_b_id
ref_b_value
ref_c_id
ref_c_value
updated_at
1
abcdef
1
opqr
2023-05-30
3
lmno
2
stuv
2023-05-30
We’ve added the rows where the source records were recently updated in either ref_b or ref_c.
I feel like this should be pretty straightforward but for some reason I’m having a hard time finding examples of others doing the same. All the examples I’ve seen only show joining on refs where both / all of the refs have been updated within the time window. We have fact tables with quite a few sources so ideally this would work across more than 2 tables
edit: I think I figured out how to do it using UNION. Not sure how fast this will be but seems logically correct to me, using actual tables from my org:
with users_count as (
SELECT
cu.owned_by_organization_id,
count(*) as user_count,
max(cu.date_updated) as user_date_updated
FROM
user_user cu
group by cu.owned_by_organization_id
),
sites_count as (
SELECT
si.owned_by_organization_id,
count(*) as site_count,
max(si.date_updated) as site_date_updated
FROM
site_site si
group by si.owned_by_organization_id
),
user_org_ids as (
SELECT owned_by_organization_id as org_id
from users_count
where users_count.user_date_updated > (select max(date_updated) from {{ this }})
),
sites_org_ids as (
SELECT
owned_by_organization_id as org_id
FROM
sites_count
WHERE
sites_count.site_date_updated > (select max(date_updated) from {{ this }})
),
org_final as (
select
org.*,
users_count.*,
sites_count.*
from user_organization org
left join users_count on users_count.owned_by_organization_id = org.id
left join sites_count on sites_count.owned_by_organization_id = org.id
{% if is_incremental() %}
where org.id in (
select org_id from user_org_ids
union
select org_id from sites_org_ids
)
{% endif %}
)
select count(*) from org_final
hi , How did you managed this case , i am still struggling to get this use case work , If you have cracked it please let us know what is the best solution