Implementing a Persistent Staging Area with a dynamic incremental model


#1

Hi all, we are building a DWH with Data Vault 2.0.

To gain agility, and reload our raw data vault on demand all project long, we try to use a persistent staging area :
each source data is copied to its own schema on schedule,
each source table has a corresponding incremental model in a staging schema, with the same columns + timestamp + hash of the non pk columns for change detection purposes.

We succeeded in implementing a dynamic incremental model in jinja. This model takes care of added or dropped source columns, modifying the model accordingly :

  • add new columns at the end of the model’s definition,
  • keep dropped columns for history.
    This model makes huge use of adapter’s functions, to detect definition’s differences and execute the corresponding SQL-DDL orders in the pre/post hooks.

Data vault entity incremental models can then be rebuilt on demand on top of these staging models, with respect to the timestamps, for auditability reasons.

Has anyone already used DBT to build a Data Vault EDWH ? We will be glad to share experiences and advices.

Best regards,
Fabrice Etanchaud


#2

Hi,

sounds interesting. I’m a newcomer to dbt, but have some 20 years of BI/DWH including DV.

I’m currently investigating using dbt taking data from staging/landing zone, to raw vault, to business vault, and finally to the data marts/stage out. I have also considered the dbt_archive for PSA, but for now I’m more interested in trying to build custom materializations to make the different loading patterns easy to implement (specify bk’s in model config, specify meta data columns like load date, load source etc in project file, auto historize, change diff hash etc).

I believe the dbt might be the perfect tool as it keeps the model and the etl/elt in one place, version controlled, and the ability to use packages for different layers, plus the dependency handling for business vault, makes this a very very interesting idea.

I’d be interested to hear/see what you’ve done.

Fridthjof-G


#3

Hi @feriksen,

I agree with you, this would really deserve ad-hoc materializations.
For now, I just used jinja macros.

Here are the built blocks (following the DV 2.0 recommandations found in Dan Lindstedt’s book):

  • hash_key macro
  • hash_diff macro (computed on the left-most non null attributes, to take care of column additions)
  • persistent staging area model

This model is built in a single macro call, introspecting the source and the model definitions in INFORMATION_SCHEMA.
Columns : all source table columns + load_dts + hash_diff
PK : source table PK + load_dts.
It’s an incremental, insert-only model, based on a source_table left join {{ this }}.

pre_hook compares {{ this }} and source_table definition and add any new columns.

post_hook creates the PK in full-refresh mode.

the model file content is a single macro call, passing the source table name and the alias name if needed.

The model has following preconditions :

  • no PK change in the source table (IMHO this should trigger a new model definition)
  • no incompatible column type change (but text can lengthen)

Now I am trying to implement a satellite model, with full-refresh capability.
I think I can go with the lead or lag window function on hash_diff, to detect true satellite deltas on top of staging deltas.

I will be very glad to have your feeling on this, considering your skills in DV :slight_smile:

Last, I am wondering if materialization is the right way… I think I am going to give jinja block inheritance a try (if DBT handles it).

Best regards,
Fabrice Etanchaud


#4

It seems DBT does not handle template inheritance for the time being.


#5

Hi, sorry for late reply - busy days :slight_smile:

Yes, I do believe custom materializations is the way to go, as the actual loading pattern is given, and with a minimum of configuration (business keys, columns selected for hash_diff for satellites, etc), together with the dependency tracking built-in to dbt, I think we have a winner.

  • By the end of the day, materializations are j"just" a set of macros too when you think about it.

I got side-tracked this last week with building a package for the metrics mart (dv-metrics, close to version 1.0, I think another week and it should be ready at least for the raw vault - still need to work a bit on how to handle bridges. pit tables are easy though), but once that is done I’ll continue on my materialization package.

when you say PK, are we talking physical/actual PK’s, or do you pass the column(s) as part of the config?

the way I see it is kind of like this:

you have a landing zone/extract table like this (pseudo example):

create table stage_.customer_list
(
customer_number varchar(128),
country_code varchar(128),
city varchar(128),
load_dts,
load_src
)

from this we build base models (base__customer_list), materialized as views:

select
customer_number,
country_code,
city,
load_dts,
load_src
from stage_.customer_list

on top of this we can now build:

-hubs:

h_customer.sql

{{ config(materialized=‘hub’, business_key=‘customer_number’) }}

ref {{ref(‘base_customer_list’)}}

h_country.sql

{{ config(materialized=‘hub’, business_key=‘country_code’) }}

select

ref {{ref(‘base_customer_list’)}}

Links:

l_customer_country.sql

{{ config(materialized=‘link’, historization=‘true’, hubs=[‘h_customer’, ‘h_country’]) }}

select X
from ref {{ref(‘base_customer_list’)}}

s_customer.sql

{{ config(materialized=‘satellite’, historization=‘true’, parent=‘h_customer’) }}

select
{{ business_key_hash }},
city,
{{ vault_meta_columns }}
from ref {{ref(‘base_customer_list’)}}

having dived through existing materializations and tracing what dbt does, there is no question in my mind that it is the way to go. The difficult part is two-fold: figuring out a sane level of config/convention/customization, as well as ensuring cross platform (I focus on postgres for now).

Again, I’d love to see what you’ve done so far, if you’re able to share the code somewhere I’d be happy to look at it, test it out, and give you some feedback.

once my dv-metrics is done, I’ll let you know, and I’d very much like to get your feedback as well :slight_smile:

Regards,

Fridthjof-G


#6

Hey @fabrice.etanchaud - I just replied to your comment over here before reading this post! I’d 100% agree that materializations are the way to go!

Some resources that i think might be helpful:

This is a super interesting conversation! I’m going to follow along here, but I’m afraid I don’t have too much prior knowledge on data vaults. Let me know if you have any dbt-specific questions – happy to help however I can!


#7

Hi @feriksen,

Yes, the model reads the source table PK in the metadata, and create the staging table PK as ‘source PK columns + load_dts’.

IMHO the better is to exchange on the PSA solution once rewritten in a materialization form.
But below, you can find the macros related to the hash_diff and hash_key functions.

They follow the DV 2.0 recommandations :

  • ‘;’ as separator
  • each value is trimed, upper-cased and casted as text
  • null is changed in empty string
  • in order to correctly detect deltas when source table columns are added, only the columns’ values up to the last non null value are taken into account (a;1;null;b;null;null yields a;1;null;b)
  • not shown here, but dropped source columns are kept in the model’s table, and replaced by null in the select column list.

Last, due to a PostgreSQL limitation (no more than 100 arguments in a function call), I had to batch coalesce or concat function calls.

Looking forward,

Fabrice Etanchaud

{%- macro batch_function_call(function_name, params, batch_size) -%}
    {{ function_name }}(
        {%- if params | length <= batch_size -%}
            {{ params | join(', ') }}
        {%- else -%}
            {%- for params_batch in params | batch(batch_size) -%}
                {{ function_name }}(
                    {{ params_batch | join(', ') }}
                )
                {%- if not loop.last -%}, {%- endif -%}
            {%- endfor -%}
        {%- endif -%}
    )
{%- endmacro -%}

{%- macro hash_key(colnames) -%}
    md5(
    {%- if colnames | length > 0 -%}
        {%- set strings = [] -%}
        {%- for colname in colnames %}
            {%- do strings.append("upper(trim(cast(" ~ colname ~ " as text)))") -%}
            {%- if not loop.last -%}
                {%- do strings.append(";") -%}
            {%- endif -%}
        {%- endfor -%}
        {{ batch_function_call('concat', strings, 100) }}
    {%- else -%}
        ''
    {%- endif -%}
    )::uuid
{%- endmacro -%}

{%- macro hash_diff(colnames) -%}
    md5(
    {%- if colnames | length > 0 -%}
        {%- set strings = [] -%}
        {%- for colname in colnames %}
            {%- do strings.append("upper(trim(cast(" ~ colname ~ " as text)))") -%}
            {%- if not loop.last -%}
                {%- set next_strings = [] -%}
                {%- for next_string in colnames[loop.index:] -%}
                    {%- do next_strings.append("cast(" ~ next_string ~ " as text)") -%}
                {%- endfor -%}
                {%- do strings.append("case when " ~ batch_function_call('coalesce', next_strings, 100) ~ " is not null then ';' end") -%}
            {%- endif -%}
        {%- endfor -%}
        {{ batch_function_call('concat', strings, 100) }}
    {%- else -%}
        ''
    {%- endif -%}
    )::uuid
{%- endmacro %}

#8

Almost done, a ‘staging’ materialization based on the ‘incremental’ one !

current config parameters are :

  • source_schema
  • source_table
  • schema/alias
  • excluded columns (from source table)

It seems to be only one problem : unlike ‘archive’ materialisation, one still have to create a sql model file, containing only the configuration.

Thank you all for your advices, looking forward to seeing how data vault entities materialisations could look like !