What follows is a step-by-step guide to creating a series of Python UDFs on Redshift for parsing user agents with Python’s user-agents
module. While our ultimate goal will be efficiently parsing user agents in Redshift, the first few steps are generalizable to building custom UDFs with any custom Python module (though your results may vary based on the complexities of the module you’re trying to use!).
1. Package up your python dependencies
When adding custom Python modules to your Redshift cluster, you need to first consider how Python modules work, and how they might possibly work on your cluster. Most modules rely on a series of other modules, which are loaded at runtime by looking at your machine’s PATH
. In most cases, pip
handles this dependency resolution without us having to worry about it. But since we can’t just open up our cluster’s terminal and type pip install user-agents
, we have to do something a bit more complicated.
We’ll follow AWS Lab’s PipLibraryInstaller guide, though we won’t use their provided script so as to encourage greater understanding of what’s going on under the hood. We’ll walk through each step of the process. While you’re encouraged to try and abstract these steps out into your own script, my personal experience has shown that different modules require different installation strategies. I also doubt you’ll ever do this more than a handful of times per cluster, so it’s not clear what you’re saving by automating this process (unless you’re in the business of managing clusters for other people, then by all means, automate!).
First, we need to build a wheel
of user-agents
. A wheel
file is the standard built-package format used for python and is essentially just a zip
file of the module’s code and metadata. To do this let’s run the following commands:
mkdir -p tmp/
pip wheel user-agents --no-cache-dir --wheel-dir tmp/user-agents/
We should now be able to inspect the specified wheel-dir
and see the following:
ls tmp/user-agents/
>>> ua_parser-0.8.0-py2.py3-none-any.whl user_agents-1.1.0-py3-none-any.whl
NOTE: As you can see, pip wheel
generated two files when packaging up user-agents
, one for the module itself, and one for ua_parser
, the module it depends on.
To add these modules to Redshift, we need to upload both files to S3 using the awscli
, which you can install via pip install --user awscli
.
aws s3 cp tmp/ua-parser/ua_parser-0.8.0-py2.py3-none-any.whl \
s3://my-bucket/ua-parser.zip
aws s3 cp tmp/ua-parser/user_agents-1.1.0-py3-none-any.whl \
s3://my-bucket/user-agents.zip
NOTE: The above commands might be slightly different depending on the version of user-agents
you install.
2. Add your modules to Redshift
After uploading your module to S3, we can now add it to our Redshift cluster by executing the following SQL statements:
CREATE LIBRARY ua_parser
LANGUAGE plpythonu
from 's3://my-bucket/ua-parser.zip'
WITH CREDENTIALS AS 'aws_access_key_id={my_key};aws_secret_access_key={my_secret}'
CREATE LIBRARY user_agents
LANGUAGE plpythonu
from 's3://my-bucket/user_agents.zip'
WITH CREDENTIALS AS 'aws_access_key_id={my_key};aws_secret_access_key={my_secret}'
NOTE: Replace {my_key}
and {my_secret}
with your AWS credentials. You can also use an iam_role
here.
3. Write your UDFs
Next, we’ll attempt to test our module by writing our UDFs. We’ll follow Amazon’s best practices for naming UDFs by placing them in a custom schema and prefixing the function name with f_
:
CREATE SCHEMA udf;
CREATE OR REPLACE FUNCTION udf.f_ua_is_bot(ua VARCHAR(MAX)) RETURNS boolean IMMUTABLE as $$
if ua is None or ua == '': return False
from user_agents import parse; return parse(ua).is_bot
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION udf.f_ua_browser(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
if ua is None or ua == '': return None
from user_agents import parse; return parse(ua).browser.family
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION udf.f_ua_browser_version(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
if ua is None or ua == '': return None
from user_agents import parse; return parse(ua).browser.version_string
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION udf.f_ua_device_brand(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
if ua is None or ua == '': return None
from user_agents import parse; return parse(ua).device.brand
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION udf.f_ua_device_family(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
if ua is None or ua == '': return None
from user_agents import parse; return parse(ua).device.family
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION udf.f_ua_device_model(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
if ua is None or ua == '': return None
from user_agents import parse; return parse(ua).device.model
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION udf.f_ua_os_family(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
if ua is None or ua == '': return None
from user_agents import parse; return parse(ua).os.family
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION udf.f_ua_os_version(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
if ua is None or ua == '': return None
from user_agents import parse; return parse(ua).os.version_string
$$ LANGUAGE plpythonu;
3. Testing your UDFs
You should now be able to use these UDFs in queries to parse user agent strings:
SELECT udf.f_ua_is_bot('Twitterbot');
>>> True
SELECT udf.f_ua_is_bot('Mozilla/1234');
>>> False
4. Optimizing your use of UDFs
One thing to know about UDFs is that they’re extremely slow. Think about what’s happening under the hood: Your database is basically piping a string out to another process and capturing the result. This will not be anywhere near as fast as a normal query. As a result you need to thoroughly optimize your user agent parsing queries to minimize the number of rows it operates on. Here are a few strategies:
Only parse distinct user agents.
Instead of naively running your UDF on every user agent string in your data warehouse, build a lookup table of unique user agents to user agent metadata, then join against this table when you want to enrich your user agent strings with additional metadata. You can also hash the user-agent string in order to speed up future joins on this table:
strtol(left(func_sha1(user_agent), 15), 16)::bigint
Don’t re-parse user agent strings
Instead of re-building this table everyday from scratch, identify the new user-agent strings that show up every day and add these to your lookup table.
5. Managing UDF permissions
Follow this guide to allow other DB users to access your UDFs.
6. Integrating with DBT
Follow @claire’s excellent guide for integrating UDFs into your DBT workflow: Using dbt to manage user defined functions