Using dbt to manage user defined functions (Redshift)


#1

Our dbt project uses some user defined functions (UDFs).
In the past, I’ve created these UDFs in my SQL console as the superuser, however this isn’t ideal as:

  • The code was not surfaced as part of our dbt project, so it’s unclear what the UDF does
  • There was no version control on the UDFs if we needed to update them
  • And we weren’t maintaining separate development/production versions of the UDFs .

So I decided to put them into our dbt project, using this process:

  1. Created a subdirectory, just to keep things tidy: macros/udfs/
  2. Created a file for each udf, e.g. macros/udfs/f_future_date.sql (below). Note that each of my UDFs is prefixed with f_, as recommended in the Redshift documentation.
{% macro f_future_date() %}
CREATE OR REPLACE FUNCTION {{target.schema}}.f_future_date()
RETURNS TIMESTAMP
IMMUTABLE AS $$
SELECT '2100-01-01'::TIMESTAMP;
$$ LANGUAGE sql
{% endmacro %}
  1. Created a macro called macros/create_udfs.sql which calls each UDF macro. Note separation with ;s. Also, note the inclusion of a create schema statement – since the target schema is created after on-run-start hooks, and I’ve used the target schema in my UDFs, I have to create that schema if it doesn’t exist.
{% macro create_udfs() %}

create schema if not exists {{target.schema}};

{{f_list_custom_keys()}};

{{f_count_custom_keys()}};

{{f_future_date()}}

{% endmacro %}
  1. Added an on-run-start hook to my project
on-run-start:
    - '{{create_udfs()}}'
  1. Updated the references to UDFs in my models to use the schema-versions, e.g.
SELECT
{{target.schema}}.F_FUTURE_DATE()

Other quirks I ran into when doing this were:

  1. As the superuser, I had to grant permission to users to create python udfs:
-- as your superuser
GRANT USAGE ON LANGUAGE PLPYTHONU TO userA;
  1. I had to use schemas because users cannot edit each others’ UDFs (i.e. if userA creates the UDF, and userB runs the CREATE OR REPLACE FUNCTION statement, a “must be owner of function” error will occur). Using schemas means you are creating distinct UDFs* so won’t hit this issue. I would recommend using schemas anyway to maintain separate dev/production UDFs

* Assuming each Redshift user profile used by dbt has a distinct target schema.


How to Create a Series of Python UDFs for Parsing User Agents in Redshift
#2

I like it! I think I’ll break up my GA post-processing code (which includes a BigQuery javascript UDF) to move the UDFs out like this to keep things appropriately modular.


#3

@claire So, I think this might be a stupid question, but why use UDFs instead of macros within dbt?


#4

Not a stupid question at all!
I use UDFs when I need to use python to solve a problem that SQL (or Jinja) can’t.
For example, one of our data sources has values stored in JSON keys (🤦), like so:

json
{“known_key”: true, “custom text”: true, “more custom text”: true}
{“known_key”: true, “even more custom text”: true}

I instead wanted my data like so:

custom_keys
[“custom text”, “more custom text”]
[“even more custom text”]

I can’t think of how Redshift SQL or Jinja would be able to do the above.
But it’s an easy problem to solve using the json python package, so I wrapped it up in a UDF.


#5

@claire Thanks so much for this post! Great tutorial. Depending on your Redshift config, you may also have to grant permissions to execute the functions you create. I do this with a macro in the post-hook:

{% for u in user_list %}
  {% for s in schema_list %}
    grant execute on all functions in schema {{ s }} to {{ u }} ;
  {% endfor %}
{% endfor %}

#6

@claire Are you recreating your production schema every time? If not, I can see 2 issues with this approach.

  1. If you change the number of arguments your UDF receives, you will end up with 2 functions with the same name but different arguments (redshift supports method overloading). Not sure this is always the wanted result.
  2. If you change the returned type, I believe you will get an error when running the script.

Regards,
Zeev


#7

Hey @zeevs
Yup, you’re totally right! A significant change in the function is going to break the macro.

I think I played around with including a DROP FUNCTION statement in my macro, but if the arguments change, then the statement would return an error (since you’d be trying to DROP a function with different arguments to the one that currently exists).

So I chose that in that case I’d handle it manually. Not perfect, but also not a common scenario thankfully :slight_smile:

I think best practice here is to drop your production schema regularly, but that’s not something that I’ve implemented


#8

I guess you could get the definition of all existing udfs from redshift’s internal tables and drop all existing udfs, and then run clean create statements for what you need.