Note: This post uses Redshift SQL, however the same concept applies to most data warehouses. Check out the notes on BigQuery in the comments below.
Our dbt project uses some user defined functions (UDFs).
In the past, I’ve created these UDFs in my SQL console as the superuser, however this isn’t ideal as:
- The code was not surfaced as part of our dbt project, so it’s unclear what the UDF does
- There was no version control on the UDFs if we needed to update them
- And we weren’t maintaining separate development/production versions of the UDFs .
So I decided to put them into our dbt project, using this process:
- Created a subdirectory, just to keep things tidy:
macros/udfs/
- Created a file for each udf, e.g.
macros/udfs/f_future_date.sql
(below). Note that each of my UDFs is prefixed withf_
, as recommended in the Redshift documentation.
{% macro create_f_future_date() %}
CREATE OR REPLACE FUNCTION {{target.schema}}.f_future_date()
RETURNS TIMESTAMP
IMMUTABLE AS $$
SELECT '2100-01-01'::TIMESTAMP;
$$ LANGUAGE sql
{% endmacro %}
- Created a macro called
macros/create_udfs.sql
which calls each UDF macro. Note separation with;s
. Also, note the inclusion of acreate schema
statement – since the target schema is created afteron-run-start
hooks, and I’ve used the target schema in my UDFs, I have to create that schema if it doesn’t exist.
{% macro create_udfs() %}
create schema if not exists {{target.schema}};
{{create_f_list_custom_keys()}};
{{create_f_count_custom_keys()}};
{{create_f_future_date()}}
{% endmacro %}
- Added an
on-run-start
hook to my project
on-run-start:
- '{{create_udfs()}}'
- Updated the references to UDFs in my models to use the schema-versions, e.g.
SELECT
{{target.schema}}.F_FUTURE_DATE()
Other quirks I ran into when doing this were:
- As the superuser, I had to grant permission to users to create python udfs:
-- as your superuser
GRANT USAGE ON LANGUAGE PLPYTHONU TO userA;
- I had to use schemas because users cannot edit each others’ UDFs (i.e. if
userA
creates the UDF, anduserB
runs theCREATE OR REPLACE FUNCTION
statement, a “must be owner of function” error will occur). Using schemas means you are creating distinct UDFs* so won’t hit this issue. I would recommend using schemas anyway to maintain separate dev/production UDFs
* Assuming each Redshift user profile used by dbt has a distinct target schema.