Using dbt to manage user defined functions

Note: This post uses Redshift SQL, however the same concept applies to most data warehouses. Check out the notes on BigQuery in the comments below.

Our dbt project uses some user defined functions (UDFs).
In the past, I’ve created these UDFs in my SQL console as the superuser, however this isn’t ideal as:

  • The code was not surfaced as part of our dbt project, so it’s unclear what the UDF does
  • There was no version control on the UDFs if we needed to update them
  • And we weren’t maintaining separate development/production versions of the UDFs .

So I decided to put them into our dbt project, using this process:

  1. Created a subdirectory, just to keep things tidy: macros/udfs/
  2. Created a file for each udf, e.g. macros/udfs/f_future_date.sql (below). Note that each of my UDFs is prefixed with f_, as recommended in the Redshift documentation.
{% macro create_f_future_date() %}
CREATE OR REPLACE FUNCTION {{target.schema}}.f_future_date()
RETURNS TIMESTAMP
IMMUTABLE AS $$
SELECT '2100-01-01'::TIMESTAMP;
$$ LANGUAGE sql
{% endmacro %}
  1. Created a macro called macros/create_udfs.sql which calls each UDF macro. Note separation with ;s. Also, note the inclusion of a create schema statement – since the target schema is created after on-run-start hooks, and I’ve used the target schema in my UDFs, I have to create that schema if it doesn’t exist.
{% macro create_udfs() %}

create schema if not exists {{target.schema}};

{{create_f_list_custom_keys()}};

{{create_f_count_custom_keys()}};

{{create_f_future_date()}}

{% endmacro %}
  1. Added an on-run-start hook to my project
on-run-start:
    - '{{create_udfs()}}'
  1. Updated the references to UDFs in my models to use the schema-versions, e.g.
SELECT
{{target.schema}}.F_FUTURE_DATE()

Other quirks I ran into when doing this were:

  1. As the superuser, I had to grant permission to users to create python udfs:
-- as your superuser
GRANT USAGE ON LANGUAGE PLPYTHONU TO userA;
  1. I had to use schemas because users cannot edit each others’ UDFs (i.e. if userA creates the UDF, and userB runs the CREATE OR REPLACE FUNCTION statement, a “must be owner of function” error will occur). Using schemas means you are creating distinct UDFs* so won’t hit this issue. I would recommend using schemas anyway to maintain separate dev/production UDFs

* Assuming each Redshift user profile used by dbt has a distinct target schema.

11 Likes

I like it! I think I’ll break up my GA post-processing code (which includes a BigQuery javascript UDF) to move the UDFs out like this to keep things appropriately modular.

1 Like

@claire So, I think this might be a stupid question, but why use UDFs instead of macros within dbt?

2 Likes

Not a stupid question at all!
I use UDFs when I need to use python to solve a problem that SQL (or Jinja) can’t.
For example, one of our data sources has values stored in JSON keys (🤦), like so:

json
{“known_key”: true, “custom text”: true, “more custom text”: true}
{“known_key”: true, “even more custom text”: true}

I instead wanted my data like so:

custom_keys
[“custom text”, “more custom text”]
[“even more custom text”]

I can’t think of how Redshift SQL or Jinja would be able to do the above.
But it’s an easy problem to solve using the json python package, so I wrapped it up in a UDF.

1 Like

@claire Thanks so much for this post! Great tutorial. Depending on your Redshift config, you may also have to grant permissions to execute the functions you create. I do this with a macro in the post-hook:

{% for u in user_list %}
  {% for s in schema_list %}
    grant execute on all functions in schema {{ s }} to {{ u }} ;
  {% endfor %}
{% endfor %}
2 Likes

@claire Are you recreating your production schema every time? If not, I can see 2 issues with this approach.

  1. If you change the number of arguments your UDF receives, you will end up with 2 functions with the same name but different arguments (redshift supports method overloading). Not sure this is always the wanted result.
  2. If you change the returned type, I believe you will get an error when running the script.

Regards,
Zeev

1 Like

Hey @zeevs
Yup, you’re totally right! A significant change in the function is going to break the macro.

I think I played around with including a DROP FUNCTION statement in my macro, but if the arguments change, then the statement would return an error (since you’d be trying to DROP a function with different arguments to the one that currently exists).

So I chose that in that case I’d handle it manually. Not perfect, but also not a common scenario thankfully :slight_smile:

I think best practice here is to drop your production schema regularly, but that’s not something that I’ve implemented

1 Like

I guess you could get the definition of all existing udfs from redshift’s internal tables and drop all existing udfs, and then run clean create statements for what you need.

1 Like

@claire, I adapted your instructions for BQ tonight and it seems one of the BigQuery Persistent UDF limitations is that you cannot define more than one CREATE FUNCTION per call.

This is apparent even by using the BigQuery UI; BQ won’t let you execute two CREATE FUNCTION statements in the same query unless they are set as TEMPORARY.

As a result, I don’t think it’s possible to use a create_udfs()-like function for BigQuery since all UDF creation will get compiled and executed in a single call. For example, my error back from dbt run is:

Syntax error: Unexpected extra token CREATE at [7:1]

I haven’t tried it yet, but I suspect calling one macro for each array element of our dbt_project.yml's on-run-start hook should solve this, since it will (from my understanding) execute each singular CREATE FUNCTION in it’s own BQ request.

I will try it out tmrw!

1 Like

@claire et al,

We are able to make BigQuery Persistent UDFs work by adapting your RedShift instructions above. It’s not as modular as we like, but it’s due to BigQuery limitations more than dbt ones.

For anyone visiting in the future, here is a contrived example that implements pi() and radian() methods, and tucks them under a f_math dataset.

Note that pi() simply returns a constant, but radian() method uses the pi() method. This has consequences on how you must call the pi() method inside of radian() (it must be project id qualified)

In our dbt_project.yml we have three on-run-start methods. One creates the schema (dataset in BigQuery) if it doesn’t exist, the other two create the Persistant UDFs.

on-run-start:

  # MATH FUNCTIONS (order must be preserved!!)
  - '{{ create_schema("f_math") }}'
  - '{{ make_func_math_pi() }}'
  - '{{ make_func_math_radian() }}'

The create_schema macro is a simple wrapper around adapter.create_schema that has target.project hard-coded into the first argument. We could have just-as easily called adapter.create_schema directly from the on-run-start hook.

-- macros/create_schema.sql

{% macro create_schema (schema_name) %}

{% do adapter.create_schema(target.project, schema_name) %}

{% endmacro %}

Our pi() UDF looks like this:

-- macros/udfs/math/pi.sql

{% macro make_func_math_pi () %}

CREATE OR REPLACE FUNCTION f_math.pi() RETURNS FLOAT64 AS (3.141592653589793);

{% endmacro %}

and our radian() UDF:

-- macros/udfs/math/radian.sql

{% macro make_func_math_radian () %}

CREATE OR REPLACE FUNCTION f_math.radian() RETURNS FLOAT64 AS (`{{ target.project }}`.f_math.pi() / 180);

{% endmacro %}

Note the need to use `{{ target.project }}` inside of the radian() UDF definition. Without this, the UDF creation will fail on dbt run because BigQuery requires that functions be fully-qualified up to the project id.


While this example works, things get hairy when these libraries grow. As such, we’ve reached out to the UDF PM to ask if the multiple CREATE Persistent limitations could be lifted.

4 Likes

in step 2, if the udf code does a create library in addition to the create function, then this might cause an issue,
as create library statement in redshift seems not valid to run inside a transaction, I got this error when I tried CREATE LIBRARY cannot run inside a transaction block

Hmm, I haven’t tried this, but it’s possible that if you run a commit; just before your create library statement, you’ll get it to work! Let me know how that goes!

Hey @dabit – I got some more info internally on this, and my previous reply is not a good one! Running a commit; isn’t a great idea because it might mess with other queries that dbt runs, where dbt is expecting a transaction to already be open.

Instead, you could run this statement as an operation (docs). So, before you dbt run, execute the following:

dbt run-operation create_udfs

You’ll have to adjust your create_udfs macro to actually run each query – the run_query macro will not begin a transaction automatically.

{% macro create_udfs() %}

{% do run_query('create or replace library my_library language plpythonu') %}

{% do run_query('create schema if not exists {{ target.schema }}') %};

{% do run_query(create_f_list_custom_keys()) %};

{% do run_query(create_f_count_custom_keys()) %};

{% do run_query(create_f_future_date() %}

{% endmacro %}

^ I haven’t tested this code but it should be close to the thing that you want!

2 Likes

thanks @claire, I just moved from a pre hook to a run-operation and it’s working :+1:
kept my create_udfs macro as is (similar to the one from your initial post minus the create library statements) i.e without wrapping them in run_query

1 Like

Hi @claire - One question I have about this is the schema. If one has overridden the default (built-in) macros/get_custom_schema.sql with their own logic, how would they make that a run-operation like that above comport with the custom schema logic being used during regular model compilation?

Here’s one way to do it that worked for me:

  1. Define the functions in a macro, and inside the macro call set_sql_header like this:
{% macro define_bigquery_functions() %}
{% call set_sql_header(config) %}
create temporary function int(value any type)
returns int64 as (
  cast(cast(value as decimal) as int64)
);
{% endcall %}
{% endmacro %}
  1. Define a shorthand macro that wraps config like this:
{% macro table() %}
{{ define_bigquery_functions() }}
{{ config(materialized='table', **kwargs) }}
{% endmacro %}
  1. Now you can just invoke table() instead of config(materialized='table') at the top of your model files, and you’ll have access to your temporary functions!

Advantages of this method:

  • You don’t have to specify the schema name in front of every function call
  • You don’t have to wait for an extra on-run-start operation to run before every dbt run
  • Actually reduces boilerplate instead of adding more of it
  • You’ll never again wonder whether config() takes materialize= or materialized=

Disadvantages of this method:

  • You can’t call these functions inside view definitions (BigQuery doesn’t support temporary functions in views)
1 Like

Hey, something that’s taken me a while to figure out and probably worth adding to your post: we were getting errors in our build pipeline because we added a model and then referenced it in a UDF in the same pull request, which then triggered a build starting with dbt seed. When the on-run-end hook ran, it failed because it was referencing models that didn’t exist.

I managed to resolve this by using the WHICH flag (new in dbt 1.0) as follows:

{% macro create_udfs() %}
  {% if flags.WHICH == "run" %}
    {{ create_unbilled_volume_report() }}
  {% else %}
    select 1 as id
  {% endif %}
{% endmacro %}

Hi folks! - Working with my team to manage our UDFs and curious if anyone was able to make traction with getting the multiple CREATE persistent limitations lifted?

1 Like

While following these instructions on dbt 1.2 with Snowflake, I was faced with the error Snowflake adapter: got empty sql statement, moving on and was able to resolve it by calling {% do run_query(create_udfs()) %} instead of just the macro.

Still, it seems a run operation may be preferred.

3 posts were split to a new topic: Use a UDF to do a dynamic rollup by date