How to Create a Series of Python UDFs for Parsing User Agents in Redshift


#1

What follows is a step-by-step guide to creating a series of Python UDFs on Redshift for parsing user agents with Python’s user-agents module. While our ultimate goal will be efficiently parsing user agents in Redshift, the first few steps are generalizable to building custom UDFs with any custom Python module (though your results may vary based on the complexities of the module you’re trying to use!).

1. Package up your python dependencies

When adding custom Python modules to your Redshift cluster, you need to first consider how Python modules work, and how they might possibly work on your cluster. Most modules rely on a series of other modules, which are loaded at runtime by looking at your machine’s PATH. In most cases, pip handles this dependency resolution without us having to worry about it. But since we can’t just open up our cluster’s terminal and type pip install user-agents, we have to do something a bit more complicated.

We’ll follow AWS Lab’s PipLibraryInstaller guide, though we won’t use their provided script so as to encourage greater understanding of what’s going on under the hood. We’ll walk through each step of the process. While you’re encouraged to try and abstract these steps out into your own script, my personal experience has shown that different modules require different installation strategies. I also doubt you’ll ever do this more than a handful of times per cluster, so it’s not clear what you’re saving by automating this process (unless you’re in the business of managing clusters for other people, then by all means, automate!).

First, we need to build a wheel of user-agents. A wheel file is the standard built-package format used for python and is essentially just a zip file of the module’s code and metadata. To do this let’s run the following commands:

mkdir -p tmp/
pip wheel user-agents --no-cache-dir --wheel-dir tmp/user-agents/

We should now be able to inspect the specified wheel-dir and see the following:

ls tmp/user-agents/
>>> ua_parser-0.8.0-py2.py3-none-any.whl user_agents-1.1.0-py3-none-any.whl

NOTE: As you can see, pip wheel generated two files when packaging up user-agents, one for the module itself, and one for ua_parser, the module it depends on.

To add these modules to Redshift, we need to upload both files to S3 using the awscli, which you can install via pip install --user awscli.

aws s3 cp tmp/ua-parser/ua_parser-0.8.0-py2.py3-none-any.whl \ 
 s3://my-bucket/ua-parser.zip
aws s3 cp tmp/ua-parser/user_agents-1.1.0-py3-none-any.whl \ 
 s3://my-bucket/user-agents.zip

NOTE: The above commands might be slightly different depending on the version of user-agents you install.

2. Add your modules to Redshift

After uploading your module to S3, we can now add it to our Redshift cluster by executing the following SQL statements:

CREATE LIBRARY ua_parser
LANGUAGE plpythonu
from 's3://my-bucket/ua-parser.zip'
WITH CREDENTIALS AS 'aws_access_key_id={my_key};aws_secret_access_key={my_secret}'

CREATE LIBRARY user_agents
LANGUAGE plpythonu
from 's3://my-bucket/user_agents.zip'
WITH CREDENTIALS AS 'aws_access_key_id={my_key};aws_secret_access_key={my_secret}'

NOTE: Replace {my_key} and {my_secret} with your AWS credentials. You can also use an iam_role here.

3. Write your UDFs

Next, we’ll attempt to test our module by writing our UDFs. We’ll follow Amazon’s best practices for naming UDFs by placing them in a custom schema and prefixing the function name with f_:

CREATE SCHEMA udf;

CREATE OR REPLACE FUNCTION udf.f_ua_is_bot(ua VARCHAR(MAX)) RETURNS boolean IMMUTABLE as $$
  if ua is None or ua == '': return False
  from user_agents import parse; return parse(ua).is_bot
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION udf.f_ua_browser(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
  if ua is None or ua == '': return None
  from user_agents import parse; return parse(ua).browser.family
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION udf.f_ua_browser_version(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
  if ua is None or ua == '': return None
  from user_agents import parse; return parse(ua).browser.version_string
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION udf.f_ua_device_brand(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
  if ua is None or ua == '': return None
  from user_agents import parse; return parse(ua).device.brand
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION udf.f_ua_device_family(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
  if ua is None or ua == '': return None
  from user_agents import parse; return parse(ua).device.family
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION udf.f_ua_device_model(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
  if ua is None or ua == '': return None
  from user_agents import parse; return parse(ua).device.model
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION udf.f_ua_os_family(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
  if ua is None or ua == '': return None
  from user_agents import parse; return parse(ua).os.family
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION udf.f_ua_os_version(ua VARCHAR(MAX)) RETURNS VARCHAR(MAX) IMMUTABLE as $$
  if ua is None or ua == '': return None
  from user_agents import parse; return parse(ua).os.version_string
$$ LANGUAGE plpythonu;

3. Testing your UDFs

You should now be able to use these UDFs in queries to parse user agent strings:

SELECT udf.f_ua_is_bot('Twitterbot');
>>> True
SELECT udf.f_ua_is_bot('Mozilla/1234');
>>> False

4. Optimizing your use of UDFs

One thing to know about UDFs is that they’re extremely slow. Think about what’s happening under the hood: Your database is basically piping a string out to another process and capturing the result. This will not be anywhere near as fast as a normal query. As a result you need to thoroughly optimize your user agent parsing queries to minimize the number of rows it operates on. Here are a few strategies:

Only parse distinct user agents.

Instead of naively running your UDF on every user agent string in your data warehouse, build a lookup table of unique user agents to user agent metadata, then join against this table when you want to enrich your user agent strings with additional metadata. You can also hash the user-agent string in order to speed up future joins on this table:

strtol(left(func_sha1(user_agent), 15), 16)::bigint

Don’t re-parse user agent strings

Instead of re-building this table everyday from scratch, identify the new user-agent strings that show up every day and add these to your lookup table.

5. Managing UDF permissions

Follow this guide to allow other DB users to access your UDFs.

6. Integrating with DBT

Follow @claire’s excellent guide for integrating UDFs into your DBT workflow: Using dbt to manage user defined functions (Redshift)