Use a UDF to do a dynamic rollup by date

I know this post is super-dead, but I am at wits’ end and this is the only discussion of using macros to create UDFs that I’ve found.

I need to do a dynamic rollup by date, and I don’t want to create a model that precomputes the rollups for all possible dates in a standard dbt model. I want to pass a date to a UDF and get back the corresponding rollup. I dug around and found this thread, and I wrote the CREATE OR REPLACE FUNCTION statement as a macro with three {{ref()}} calls to the necessary tables.

This seems simple, but I can’t get it to run as either run-operationor as an on-run-end hook. In both cases, {{ref(name_of_table)}} fails to resolve.

With run-operation:

18:15:38  Encountered an error while running operation: Compilation Error
  argument of type 'NoneType' is not iterable

  > in macro ref (macros/patch.sql)
  > called by macro name_of_macro (macros/udfs/macro_file.sql)
  > called by <Unknown>

With on-run-end:

on-run-end failed, error:
 dbt was unable to infer all dependencies for the model "xyz-on-run-end-0".
This typically happens when ref() is placed within a conditional block.
To fix this, add the following hint to the top of the model "xyz-on-run-end-0":

-- depends_on: {{ ref('name_of_table') }}

Also worth noting, even for a VERY simple command, things don’t behave as I’d expect. The below fails silently with run-operation, while the database logs WARNING: there is already a transaction in progress

{% macro name_of_macro() %}

DROP FUNCTION IF EXISTS {{target.schema}} name_of_udf

{% endmacro %}

Once I wrapped it in do run_query, it worked properly. Unfortunately, the CREATE OR REPLACE FUNCTION macro doesn’t work in any configuration I’ve been able to make

I’ve wrapped the whole thing in run_query, I’ve added --depends_on everywhere I can imagine it might work, but nothing. I can’t find any relevant documentation, so here I am, posting. Thanks for reading.

Just in case someone else has the same issue and finds this, I actually solved it right after posting… Had to put the depends_on comments in my dbt_project.yml file, right above on-run-end, like so:

## depends_on: {{ ref('table_2') }}
## depends_on: {{ ref('table_3') }}
on-run-end:
  - '{{create_udfs()}}'
1 Like

I was completely wrong. Adding those comments caused the model to succeed on next run only because editing dbt_project.yml leads dbt to parse the tree again, and for whatever reason it’s able to do it correctly on a second time. rm -r target/ && dbt parse reverts state back to broken again until you edit dbt_project.yml to trigger whatever this alternate parsing is.

This has to be a bug. I’ll file a report. In the meantime, dbt is just unable to support my use-case, and after many hours of trying (the UDF took about 20 minutes to write), I’m going to have to do something different.

Hey @claires, sorry you’re having trouble! I’m not an expert with UDFs or run-operation so not sure how far I’ll get, but happy to help you try.

Can you post the content of the UDF you’re trying to run? Where are you aiming to invoke it from? I understand the use case of customised rollups, but don’t know where/how you’re planning to trigger it.

A couple of potential ideas (one short-term, one long-term). I’m throwing a lot of stuff at the wall here, so let me know if I’m off base.

Short term: put the macro in multiple model files

Standard practice when you need to create multiple tables which have similar functionality but different configurations would be to create a macro and put that macro inside otherwise-empty model files. This means that each node gets correctly represented in the DAG (if you create a table in a UDF, it won’t appear anywhere).

This would look something like this:

-- macros/rollup_data.sql
{% macro rollup_data(datepart) %}
  select 
    date_trunc({{ datepart}}) as date_{{ datepart }},
    count(*)
  from {{ ref('something') }}
group by 1
{% endmacro %}
--models/rollup_by_day.sql
-- depends_on: {{ ref('something') }}
{{ rollup_data('day')
--models/rollup_by_week.sql
-- depends_on: {{ ref('something') }}
{{ rollup_data('week')

:point_up: an alternative approach: if you passed the model refs into the macro and used those arguments in the macro code, you wouldn’t need the depends_on comments. Arguably uglier, but has the benefit of not having to hardcode the dependency :man_shrugging:

Longer term: are you trying to build a metrics store?

You might want to look into the Semantic Layer and metrics, which would enable you to create arbitrary rollups at execution time instead of having to pre-build all the elements of an OLAP cube.

1 Like

Hi Joel -

I really appreciate your quick reply!

I’ll work on obfuscating our business logic into a simpler example so I can share it. In the meantime, I don’t think that splitting the macro into separate models will work - the idea is to call the UDF directly from production and get the appropriate rollup at runtime (it’s financial data that we want to be able to roll up as of today or any other date in the past, rather than rolling up at different granularities).

But metrics… this does seem like something metrics might help with, and we have some use-cases for metrics elsewhere in our stack. I’m going to dig in on that today; not sure why it didn’t occur to me already! Thank you!

1 Like

Apologies for reposting, the original contained too much information about our schema.

Here are our two macros, heavily simplified but including the date logic we need. One UDF just overloads the other.

{% macro udf_stocks_and_valuations_as_of_date() %}

CREATE OR REPLACE FUNCTION {{ target.schema }}.udf_stocks_and_valuations_as_of(as_of DATE) RETURNS TABLE (
    company_id BIGINT,
    vested_shares BIGINT,
    total_common_shares BIGINT,
) AS
$$ WITH


stock_certs AS (
    SELECT * FROM {{ ref('stocks') }}
),

cap_tables AS (
    SELECT * FROM {{ ref('cap_tables') }}
),

companies AS (
    SELECT * FROM {{ ref('companies') }}
),

stock_calcs AS (
    SELECT
        *,
        CASE
            WHEN vesting_period = 0
                THEN num_shares
            WHEN
                EXTRACT(YEAR FROM AGE(as_of, issued_on)) * 12 + EXTRACT(MONTH FROM AGE(as_of, issued_on)) < 12
                THEN 0
            ELSE
                (num_shares / (12 * vesting_period)) *
                (EXTRACT(YEAR FROM AGE(as_of, issued_on)) * 12 + EXTRACT(MONTH FROM AGE(as_of, issued_on)))
        END AS vested_shares


    FROM stock_certs
),

final AS (
    SELECT
        companies.company_id,
        SUM(CASE
            WHEN stock_calcs.share_class = 'common'
                THEN stock_calcs.vested_shares
        END) AS vested_shares,
        SUM(CASE
            WHEN stock_calcs.share_class = 'common'
                THEN stock_calcs.num_shares
        END) AS total_common_shares


    FROM stock_calcs
    INNER JOIN cap_tables ON cap_tables.cap_table_id = stock_calcs.cap_table_id
    INNER JOIN companies ON cap_tables.company_id = companies.company_id
    GROUP BY companies.company_id
),

SELECT * FROM final
$$ LANGUAGE SQL;

{% endmacro %}

{% macro udf_stocks_and_valuations_as_of_month() %}

CREATE OR REPLACE FUNCTION {{ target.schema }}.udf_stocks_and_valuations_as_of(yyyy INT, mm INT) RETURNS TABLE (
    company_id BIGINT,
    vested_shares BIGINT,
    total_common_shares BIGINT,
) AS $$
    SELECT *
    FROM {{ target.schema }}.udf_stocks_and_valuations_as_of((make_date(yyyy, mm, 1) + INTERVAL '1 month' - INTERVAL '1 day')::DATE)
$$ LANGUAGE SQL;
{% endmacro %}

I have an additional macro with a conditional based on one of the comments in the original post. I can’t just call the udf creation macros outright, because they they’ll run on a dbt seed, when the referenced tables don’t exist.

-- idempotently create all Postgres UDFs
{% macro create_udfs() %}
  -- Only run this if the outermost DBT command is "run"
  -- (allows us to use on-run-end rather than manual run-operations on container spinup)
  {% if flags.WHICH == "run" %}
    {{ udf_stocks_and_valuations_as_of_date() }}
    {{ udf_stocks_and_valuations_as_of_month() }}
  {% else %}
    select 1 as id
  {% endif %}
{% endmacro %}

Then we have an on-run-end hook in dbt_project.yml.

on-run-end:
  - '{{ create_udfs() }}'

I think this bug can be recreated just by:

  1. Creating a macro that calls {{ref()}}
  2. Calling that macro from another macro within a conditional
  3. Calling the outer macro as on-run-end (or, likely, on-run-start)

Thanks for sharing that extra context! Super helpful.

From looking at your query, I would consider skipping the UDF altogether and instead use a date spine pattern.

This would look something like this:

--models/date_spine.sql

/*
Use whatever range makes sense for your business.
I did 100 years when I set this up at my last job and 
filtered appropriately at query time
*/

{{ dbt_utils.date_spine(
    datepart="day",
    start_date="cast('2000-01-01' as date)",
    end_date="cast('2099-12-31' as date)"
   )
}}

Then in your model, you fan this out day by day with an inequality join. Note that this also means you can use dbt’s datediff abstraction instead of having to do the Postgres month math yourself (I’ve left the original code commented out):

--models/stocks_and_valuations_as_of_date.sql

WITH
date_spine as (
    SELECT * FROM {{ ref('date_spine') }}
),

stock_certs AS (
    SELECT * FROM {{ ref('stocks') }}
),

cap_tables AS (
    SELECT * FROM {{ ref('cap_tables') }}
),

companies AS (
    SELECT * FROM {{ ref('companies') }}
),

stock_calcs AS (
    SELECT
        date_spine.date_day, 
        stocks.*,
        CASE
            WHEN vesting_period = 0
                THEN num_shares
            WHEN
                {{ dbt.datediff('date_spine.date_day', 'stocks.issued_on', 'month') }} < 12
                --EXTRACT(YEAR FROM AGE(date_spine.date_day, issued_on)) * 12 + EXTRACT(MONTH FROM AGE(date_spine.date_day, issued_on)) < 12
                THEN 0
            ELSE
                (num_shares / (12 * vesting_period)) *
                ( {{ dbt.datediff('date_spine.date_day', 'stocks.issued_on', 'month') }} < 12 )
                --(EXTRACT(YEAR FROM AGE(date_spine.date_day, issued_on)) * 12 + EXTRACT(MONTH FROM AGE(date_spine.date_day, issued_on)))
        END AS vested_shares

    FROM date_spine
    LEFT OUTER JOIN stocks ON date_spine.date_day >= stocks.issued_on
    WHERE date_spine.date_day >= '2021-12-31'
      AND date_spine.date_day < '2030-01-01' -- instead of hardcoding it, it'd be nice to get the furthermost date where vesting actually completes
),

final AS (
    SELECT
        stock_calcs.date_day, 
        companies.company_id,
        SUM(CASE
            WHEN stock_calcs.share_class = 'common'
                THEN stock_calcs.vested_shares
        END) AS vested_shares,
        SUM(CASE
            WHEN stock_calcs.share_class = 'common'
                THEN stock_calcs.num_shares
        END) AS total_common_shares


    FROM stock_calcs
    INNER JOIN cap_tables ON cap_tables.cap_table_id = stock_calcs.cap_table_id
    INNER JOIN companies ON cap_tables.company_id = companies.company_id
    GROUP BY stock_calcs.date_day, companies.company_id
)

SELECT * FROM final

Assuming your input table looks like this:

company_id employee_id issued_on vesting_period num_shares
123 abc 2022-01-01 4 480
123 def 2022-01-22 4 960
123 ghi 2023-04-12 0 480
456 jkl 2023-02-01 4 480

Then the results of the intermediate stock_calcs CTE would look like:

date_day company_id employee_id issued_on vesting_period num_shares vested_shares
2021-12-31 null null null null null null
2022-01-01 123 abc 2022-01-01 4 480 0
2022-01-22 123 abc 2022-01-01 4 480 0
2022-01-22 123 def 2022-01-22 4 960 0
2023-01-01 123 abc 2022-01-01 4 480 120
2023-01-01 123 def 2022-01-22 4 960 0
2023-02-01 123 abc 2022-01-01 4 480 130
2023-02-01 123 def 2022-01-22 4 960 120
2023-02-01 456 jkl 2023-02-01 4 480 0
2023-04-12 123 abc 2022-01-01 4 480 150
2023-04-12 123 def 2022-01-22 4 960 140
2023-04-12 123 ghi 2023-04-12 0 480 480
2023-04-12 456 jkl 2023-02-01 4 480 0

Which is pretty overwhelming and gross! But after aggregration in the final CTE, you wind up with a nice wee table with a row per company and day:

date_day company_id vested_shares total_common_shares
2021-12-31 null null null
2022-01-01 123 0 480
2022-01-22 123 0 1440
2023-01-01 123 120 1440
2023-02-01 123 250 1440
2023-02-01 456 0 480
2023-04-12 123 770 1920
2023-04-12 456 0 480

From there, you can also make your second model which is just the state of play on the first of the month. Something like:

--models/stocks_and_valuations_as_of_month
select *
from {{ ref('stocks_and_valuations_as_of_date') }}
where extract(day from date_day) = 1 --I hope this is right! However you get the first of the month in Postgres

And then your production code will just query the resulting table:

select * 
from analytics.stocks_and_valuations_as_of_date
where company_id = 123
and date_day = '2023-04-01'

This was very long, and still doesn’t answer your UDF question! But it is a more dbtonic way of thinking about it; hopefully that’s OK and the UDF was a means to the end instead of a mandatory implementation detail for some reason.

lmk what you think!

1 Like

OK I just scrolled back to the top and saw

:sweat_smile:

In that case I’ll ask around and see whether I can find anyone who knows why your UDF won’t build correctly

1 Like

Thanks for putting so much work into this!

As you noticed, though, I am hoping to avoid precomputing it. We just wrote a version that might correctly precompute (very hard to validate). Given all of the additional complexity that I removed from this toy example (this rollup is just one of many that need to be performed), we ended up with over 500 lines of code using a data spine method. That’s simply never going to be maintainable. Not to mention, if we go for daily granularity, our small initial data set (which we expect to grow exponentially) produces over 3 million rows. Really an ideal candidate for runtime in-database computation!

I’m hopeful that dbt will find a way to support UDFs more directly in the future.

1 Like

I’m glad you got the spine working, but sorry that it turned into such a monster!

I heard back with another strategy (and if anyone else is reading this, don’t do it unless you reeeeeally need to because it’s sort of a bad idea!):

You could inspect the contents of the graph object to get the location of a model. I had to do this in a very early version of the dbt_metrics package and it’s super gross because it breaks the node out of the dependency graph. In this case that’s the appropriate thing to do though - it’s also why I had to do it, because we were also resolving refs at runtime in the same sort of way as your UDF.

1 Like