How to feed DWH models from multiple async sources

daniele.frigo · June 25, 2020, 3:12pm

Hi all,
I’d like to use dbt to build a DWH with multiple source systems (one for each company of the group) which I need to load at different schedules during the day.
The idea is to land to a common DWH data model, so that e.g. my INVOICE table will contain data coming from all the different source systems (each one with its own data structure, and thus with its own transformations).

I know I cannot have in dbt more than one model pointing to the same physical table, so I thought I could use 2 different approaches:

create one dbt project for each source system (which is not my prefered choice)
create an intermediate model for each source system, having the same data structure as the target, and then load the final DWH table with something like
select * from {{ ref( var('invoice') ) }}
giving a different value to var(‘invoice’) for each run depending on the source system I’m loading.

Did some of you had the same problem in the past?
How did you adress it?
Do you see any draw back in approach 2?
The bad thing I see is that documentation won’t be complete, i.e. the DAG will point to the value of var(‘invoice’) I used when I generated the documentation.

Thanks
Daniele

fabrice.etanchaud · June 29, 2020, 4:28pm

Hi @daniele.frigo, did you have a look at the standard dwh model (staging/warehouse/marts) ?
I don’t understand why you would have to change your source at each dbt run ?
Why not union all your source invoice tables in a final invoce model ?

A proven approach to safely combine many sources is data vault.

Hoping it helps,
Best regards,

daniele.frigo · June 29, 2020, 4:46pm

@fabrice.etanchaud I perfectly know the way a standard DWH works.
As I tried to explain, I need to feed a common data model from different source system (one for each company), but since they in different places of the world, I need to schedule each of them at different times of the day.
I’d like to avoid reloading all the companies data any time I need to refresh one of them.
I could feed some kind of run id and filter on it when I do a union, but on some databases this could be anyway not the most efficient loading strategy.

Topic		Replies	Views
Mutiple DBT models for one table Archive	3	9513	April 14, 2022
Problems having 2 dbt projects and assigning one model from one project as a source for the other project Help	0	308	May 15, 2024
DBT, concurrent runs and the need for locking Archive	1	4537	April 8, 2022
dbt source file with multiple schemas per source Help dbt-core	3	6984	July 24, 2023
Multiple sources into one data vault hub Help	0	1048	February 10, 2023

How to feed DWH models from multiple async sources

Related topics