Implementing a Persistent Staging Area with a dynamic incremental model

fabrice.etanchaud · February 27, 2019, 1:25pm

Hi all, we are building a DWH with Data Vault 2.0.

To gain agility, and reload our raw data vault on demand all project long, we try to use a persistent staging area :
each source data is copied to its own schema on schedule,
each source table has a corresponding incremental model in a staging schema, with the same columns + timestamp + hash of the non pk columns for change detection purposes.

We succeeded in implementing a dynamic incremental model in jinja. This model takes care of added or dropped source columns, modifying the model accordingly :

add new columns at the end of the model’s definition,
keep dropped columns for history.
This model makes huge use of adapter’s functions, to detect definition’s differences and execute the corresponding SQL-DDL orders in the pre/post hooks.

Data vault entity incremental models can then be rebuilt on demand on top of these staging models, with respect to the timestamps, for auditability reasons.

Has anyone already used DBT to build a Data Vault EDWH ? We will be glad to share experiences and advices.

Best regards,
Fabrice Etanchaud

feriksen · March 3, 2019, 1:28pm

Hi,

sounds interesting. I’m a newcomer to dbt, but have some 20 years of BI/DWH including DV.

I’m currently investigating using dbt taking data from staging/landing zone, to raw vault, to business vault, and finally to the data marts/stage out. I have also considered the dbt_archive for PSA, but for now I’m more interested in trying to build custom materializations to make the different loading patterns easy to implement (specify bk’s in model config, specify meta data columns like load date, load source etc in project file, auto historize, change diff hash etc).

I believe the dbt might be the perfect tool as it keeps the model and the etl/elt in one place, version controlled, and the ability to use packages for different layers, plus the dependency handling for business vault, makes this a very very interesting idea.

I’d be interested to hear/see what you’ve done.

Fridthjof-G

fabrice.etanchaud · March 4, 2019, 8:51am

Hi @feriksen,

I agree with you, this would really deserve ad-hoc materializations.
For now, I just used jinja macros.

Here are the built blocks (following the DV 2.0 recommandations found in Dan Lindstedt’s book):

hash_key macro
hash_diff macro (computed on the left-most non null attributes, to take care of column additions)
persistent staging area model

This model is built in a single macro call, introspecting the source and the model definitions in INFORMATION_SCHEMA.
Columns : all source table columns + load_dts + hash_diff
PK : source table PK + load_dts.
It’s an incremental, insert-only model, based on a source_table left join {{ this }}.

pre_hook compares {{ this }} and source_table definition and add any new columns.

post_hook creates the PK in full-refresh mode.

the model file content is a single macro call, passing the source table name and the alias name if needed.

The model has following preconditions :

no PK change in the source table (IMHO this should trigger a new model definition)
no incompatible column type change (but text can lengthen)

Now I am trying to implement a satellite model, with full-refresh capability.
I think I can go with the lead or lag window function on hash_diff, to detect true satellite deltas on top of staging deltas.

I will be very glad to have your feeling on this, considering your skills in DV

Last, I am wondering if materialization is the right way… I think I am going to give jinja block inheritance a try (if DBT handles it).

Best regards,
Fabrice Etanchaud

fabrice.etanchaud · March 5, 2019, 4:47pm

It seems DBT does not handle template inheritance for the time being.

feriksen · March 5, 2019, 8:39pm

Hi, sorry for late reply - busy days

Yes, I do believe custom materializations is the way to go, as the actual loading pattern is given, and with a minimum of configuration (business keys, columns selected for hash_diff for satellites, etc), together with the dependency tracking built-in to dbt, I think we have a winner.

By the end of the day, materializations are j"just" a set of macros too when you think about it.

I got side-tracked this last week with building a package for the metrics mart (dv-metrics, close to version 1.0, I think another week and it should be ready at least for the raw vault - still need to work a bit on how to handle bridges. pit tables are easy though), but once that is done I’ll continue on my materialization package.

when you say PK, are we talking physical/actual PK’s, or do you pass the column(s) as part of the config?

the way I see it is kind of like this:

you have a landing zone/extract table like this (pseudo example):

create table stage_.customer_list
(
customer_number varchar(128),
country_code varchar(128),
city varchar(128),
load_dts,
load_src
)

from this we build base models (base__customer_list), materialized as views:

select
customer_number,
country_code,
city,
load_dts,
load_src
from stage_.customer_list

on top of this we can now build:

-hubs:

h_customer.sql

ref {{ref(‘base_customer_list’)}}

h_country.sql

select

ref {{ref(‘base_customer_list’)}}

Links:

l_customer_country.sql

{{ config(materialized=‘link’, historization=‘true’, hubs=[‘h_customer’, ‘h_country’]) }}

select X
from ref {{ref(‘base_customer_list’)}}

s_customer.sql

{{ config(materialized=‘satellite’, historization=‘true’, parent=‘h_customer’) }}

select
{{ business_key_hash }},
city,
{{ vault_meta_columns }}
from ref {{ref(‘base_customer_list’)}}

having dived through existing materializations and tracing what dbt does, there is no question in my mind that it is the way to go. The difficult part is two-fold: figuring out a sane level of config/convention/customization, as well as ensuring cross platform (I focus on postgres for now).

Again, I’d love to see what you’ve done so far, if you’re able to share the code somewhere I’d be happy to look at it, test it out, and give you some feedback.

once my dv-metrics is done, I’ll let you know, and I’d very much like to get your feedback as well

Regards,

Fridthjof-G

drew · March 5, 2019, 11:34pm

Hey @fabrice.etanchaud - I just replied to your comment over here before reading this post! I’d 100% agree that materializations are the way to go!

Some resources that i think might be helpful:

building cross-database compatible macros: Use the “adapter macro” pattern as shown here
expanding string sizes: Use the adapter.expand_target_column_types as shown here

This is a super interesting conversation! I’m going to follow along here, but I’m afraid I don’t have too much prior knowledge on data vaults. Let me know if you have any dbt-specific questions – happy to help however I can!

fabrice.etanchaud · March 6, 2019, 12:47pm

Hi @feriksen,

Yes, the model reads the source table PK in the metadata, and create the staging table PK as ‘source PK columns + load_dts’.

IMHO the better is to exchange on the PSA solution once rewritten in a materialization form.
But below, you can find the macros related to the hash_diff and hash_key functions.

They follow the DV 2.0 recommandations :

‘;’ as separator
each value is trimed, upper-cased and casted as text
null is changed in empty string
in order to correctly detect deltas when source table columns are added, only the columns’ values up to the last non null value are taken into account (a;1;null;b;null;null yields a;1;null;b)
not shown here, but dropped source columns are kept in the model’s table, and replaced by null in the select column list.

Last, due to a PostgreSQL limitation (no more than 100 arguments in a function call), I had to batch coalesce or concat function calls.

Looking forward,

Fabrice Etanchaud

{%- macro batch_function_call(function_name, params, batch_size) -%}
    {{ function_name }}(
        {%- if params | length <= batch_size -%}
            {{ params | join(', ') }}
        {%- else -%}
            {%- for params_batch in params | batch(batch_size) -%}
                {{ function_name }}(
                    {{ params_batch | join(', ') }}
                )
                {%- if not loop.last -%}, {%- endif -%}
            {%- endfor -%}
        {%- endif -%}
    )
{%- endmacro -%}

{%- macro hash_key(colnames) -%}
    md5(
    {%- if colnames | length > 0 -%}
        {%- set strings = [] -%}
        {%- for colname in colnames %}
            {%- do strings.append("upper(trim(cast(" ~ colname ~ " as text)))") -%}
            {%- if not loop.last -%}
                {%- do strings.append(";") -%}
            {%- endif -%}
        {%- endfor -%}
        {{ batch_function_call('concat', strings, 100) }}
    {%- else -%}
        ''
    {%- endif -%}
    )::uuid
{%- endmacro -%}

{%- macro hash_diff(colnames) -%}
    md5(
    {%- if colnames | length > 0 -%}
        {%- set strings = [] -%}
        {%- for colname in colnames %}
            {%- do strings.append("upper(trim(cast(" ~ colname ~ " as text)))") -%}
            {%- if not loop.last -%}
                {%- set next_strings = [] -%}
                {%- for next_string in colnames[loop.index:] -%}
                    {%- do next_strings.append("cast(" ~ next_string ~ " as text)") -%}
                {%- endfor -%}
                {%- do strings.append("case when " ~ batch_function_call('coalesce', next_strings, 100) ~ " is not null then ';' end") -%}
            {%- endif -%}
        {%- endfor -%}
        {{ batch_function_call('concat', strings, 100) }}
    {%- else -%}
        ''
    {%- endif -%}
    )::uuid
{%- endmacro %}

fabrice.etanchaud · March 6, 2019, 5:00pm

Almost done, a ‘staging’ materialization based on the ‘incremental’ one !

current config parameters are :

source_schema
source_table
schema/alias
excluded columns (from source table)

It seems to be only one problem : unlike ‘archive’ materialisation, one still have to create a sql model file, containing only the configuration.

Thank you all for your advices, looking forward to seeing how data vault entities materialisations could look like !

fetanchaud · November 14, 2019, 9:57am

Hi all, I would like to share the code base, but the problem is it is not mine (my project was not meant to be released as open source). But I am rewriting it on my very own, and as it will be completely new code corpus, I will able to publish it. First step, the persistent staging area table, maybe based on the dbt’s snaphot feature. I am following @drew’s advice, each data vault entity deserves a materialization.

DVAlexHiggs · January 24, 2020, 5:49pm

Hi there,

We’ve released an open-source package for dbt called dbtvault, which provides macros to generate Data Vault 2.0 structures and seeks to be generic and user-friendly.

It’s early days but we’re continually improving by adding new DV2.0 structures and making it easier to use. We’re looking for contributions and community feedback and your work sounds like a promising addition to the package.

Take a look: https://github.com/Datavault-UK/dbtvault/

If you would like to get in touch, the best way is to join our slack channel: https://join.slack.com/t/dbtvault/shared_invite/enQtODY5MTY3OTIyMzg2LWJlZDMyNzM4YzAzYjgzYTY0MTMzNTNjN2EyZDRjOTljYjY0NDYyYzEwMTlhODMzNGY3MmU2ODNhYWUxYmM2NjA

Perhaps we can work together on this!

Topic		Replies	Views
Dynamically set materialization based on run_started_at Help jinja , materialization , snowflake , dbt-cloud	1	1842	December 12, 2023
How to handle incremental changes in data valut model (bigquery tables) Help incremental	0	1042	November 8, 2022
How we (used to) structure our dbt projects Archive	22	196521	September 7, 2022
Custom materialisation Help materialization , dbt-core	5	752	July 30, 2024
Incremental Best Practices Help incremental , best-practice , dbt-cloud	2	3738	January 15, 2024

Implementing a Persistent Staging Area with a dynamic incremental model

Related topics