Generate dummy record based on columns in table

The problem I’m having

I try to generate a record with default values in my dimensions. Unfortunately, I can’t seem to be able to do this without building the dimension twice (this is required in the example below) or run to a cycle exception.

The context of why I’m trying to do this

My customer requires that every dimension has a dummy record to prevent ‘left join’ in reporting queries. Also this should make it easier for developers to add code in an automated way (less error prone).

What I’ve already tried

The macro below is what I have tried. This generates a union that selects all the required columns from the relation (in this example ‘this’ is used). All fields are default values based on a set of rules. The way I understand the keyword ‘this’ is that it looks at an already existing table. I have also used ref() to try this, but that gives me a cycle exception.

Some example code or error messages

{%- macro get_default_value(column) -%}
    {%- if column.is_string() and not (column.name|lower).endswith("_sk") -%}
        'N/A'
    {%- elif column.is_string() and (column.name|lower).endswith("_sk") -%}
        {{ dbt_utils.generate_surrogate_key(["'-1'"]) }}
    {%- elif (column.is_number() or column.is_float()) and (column.name|lower).endswith("_id") -%}
        '-1'
    {%- elif (column.is_number() or column.is_float()) and not (column.name|lower).endswith("_id") -%}
        '0'
    {%- elif (column.data_type|lower) in ["timestamp", "timestamp_ntz", "timestamp_ltz", "date"] and "a_date_i" not in (column.name|lower) and "a_date_u" not in (column.name|lower) -%}
        '1900-01-01'::timestamp_ntz
    {%- elif (column.data_type|lower) in ["timestamp", "timestamp_ntz", "timestamp_ltz", "date"] and ("a_date_i" in (column.name|lower) or "a_date_u" in (column.name|lower)) -%}
        current_timestamp
    {%- elif (column.data_type|lower) == "boolean" -%}
        FALSE
    {%- else -%}
        NULL
    {%- endif -%}
{%- endmacro -%}

{%- macro dummy_data() %}
    {%- set all_columns = adapter.get_columns_in_relation(this) -%}
    {% if all_columns %}
        union select
        {% for column in all_columns %}
            {%- if not loop.first %},{% endif -%}
            {{ get_default_value(column) }} as {{ column.name }}
        {% endfor %}
    {% endif %}
{% endmacro -%}

How it is used:

with
    current_data as (
        select aggregate_type, max(a_date_file) as a_date_file
        from {{ ref("wms_event_log") }}
        group by aggregate_type
    ),
    unit_types as (
        select distinct
            aggregate_type,
            a_date_file
        from {{ ref("wms_event_log") }}
    )


select 
    {{ dbt_utils.generate_surrogate_key(["unit_types.aggregate_type"]) }} as eut_d_sk,
    unit_types.aggregate_type as eut_event_unit_type,
    current_timestamp as eut_a_date_i,
    current_timestamp as eut_a_date_u,
    unit_types.a_date_file as eut_change_date
from unit_types
inner join current_data
    on unit_types.aggregate_type = current_data.aggregate_type
        and unit_types.a_date_file = current_data.a_date_file
{% if is_incremental() %}
  where {{ batch_date_validation(["eut_change_date"]) }}
{% endif %}
{{ dummy_data() }}

What if you run this macro in a post-hook? You would have to make some changes like insert instead of union, but it might work

That’s what I found out as an option after reading another article as well. But as a newby I didn’t think of that straight away. For future reference, you can find my code below.

{%- macro get_default_value(column) -%}
    {%- if column.is_string() and not (column.name|lower).endswith("_sk") and not (column.name|lower).endswith("_id") -%}
        '{{ "N/A" if column.string_size() > 2 else "N" }}'
    {%- elif column.is_string() and (column.name|lower).endswith("_sk") -%}
        {{ dbt_utils.generate_surrogate_key(["'-1'"]) }}
    {%- elif (column.name|lower).endswith("_id") -%}
        '-1'
    {%- elif (column.is_number() or column.is_float()) and not (column.name|lower).endswith("_id") -%}
        '0'
    {%- elif (column.data_type|lower) in ["timestamp", "timestamp_ntz", "timestamp_ltz", "date"] and "a_date_i" not in (column.name|lower) and "a_date_u" not in (column.name|lower) -%}
        '1000-01-01'::timestamp_ntz
    {%- elif (column.data_type|lower) in ["timestamp", "timestamp_ntz", "timestamp_ltz", "date"] and ("a_date_i" in (column.name|lower) or "a_date_u" in (column.name|lower)) -%}
        current_timestamp
    {%- elif (column.data_type|lower) == "boolean" -%}
        FALSE
    {%- else -%}
        NULL
    {%- endif -%}
{%- endmacro -%}

{%- macro create_dummy_data(skey) %}
{% if not is_incremental() %}
    {%- set all_columns = adapter.get_columns_in_relation(this) -%}
    {% if all_columns %}
        merge into {{ this }} tgt 
        using (select
        {% for column in all_columns %}
            {%- if not loop.first %},{% endif -%}
            {{ get_default_value(column) }} as {{ column.name }}
        {% endfor %}) src
        on tgt.{{ skey }} = src.{{ skey }}
        when not matched then
        insert ({% for column in all_columns %}
            {%- if not loop.first %},{% endif -%}
            {{ column.name }}
        {% endfor %})
        values ({% for column in all_columns %}
            {%- if not loop.first %},{% endif -%}
            src.{{ column.name }}
        {% endfor %})
    {% endif %}
{% endif %}
{% endmacro -%}

How it is implemented:

{{ config(post_hook="{{ create_dummy_data('eut_d_sk') }}") }}

with
    current_data as (
        select aggregate_type, max(a_date_file) as a_date_file
        from {{ ref("wms_event_log") }}
        group by aggregate_type
    ),
    unit_types as (
        select distinct
            aggregate_type,
            a_date_file
        from {{ ref("wms_event_log") }}
    )


select 
    {{ dbt_utils.generate_surrogate_key(["unit_types.aggregate_type"]) }} as eut_d_sk,
    unit_types.aggregate_type as eut_event_unit_type,
    current_timestamp as eut_a_date_i,
    current_timestamp as eut_a_date_u,
    unit_types.a_date_file as eut_change_date
from unit_types
inner join current_data
    on unit_types.aggregate_type = current_data.aggregate_type
        and unit_types.a_date_file = current_data.a_date_file
{% if is_incremental() %}
  where {{ batch_date_validation(["eut_change_date"]) }}
{% endif %}
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.