PII Anonymization and DBT

Personally Identifiable Information (PII) is a real business concern with Data Warehouses. GDPR has more people talking about it now, but even without the new legislation it is something that needs careful consideration.
Having a well structured Data Warehouse with exposed PII makes walking out the door with your entire customer email list as simple as “SELECT * FROM customers”… and that’s scary.

Problem is, there are legit times you will need PII access. If an analyst is doing a study that needs components of the PII (like email provider), then a hashed string won’t be very helpful. If you want to do ‘fuzzy matching’ when creating unified customer profiles, you need to start with raw PII.

Here’s how we tackle it at RevZilla:

  1. Transform common PII values (first name, last name, email address) in an ephemeral model using a standardized macro. Simple transforms like trimming and down-casing allow us to compare apples to apples.

  2. Materialize a hash salt under a _PRIVATE_schema, something like this:
    – DB specific ways of random gen, I just did this as a semi-universal example.
    – ideally you would do ascii lookups on the integers the rand returns.
    SELECT RANDOM()::text AS salt LIMIT 1

  3. We materialize our production tables in a PRIVATE schema, with the prefix PRIVATE on the table. I.e ANALYSIS.PRIVATE.PRIVATE_USERS. This schema is restricted and only our PII_GRANTED role can access it.

  4. We create a view of the table in the appropriate business schema. The view should be sans prefix, I.e. ANALYSIS.MARKETING.USERS. This view includes a join to our SALT table. Then it is as a simple macro for each PII field that hashes the value with the salt:
    {%- macro anaon(col, salt) -%}
    MD5({{col}}, {{salt}})
    {%- endmacro -%}

And now the users have a safe, hashed value that they can aggregate on (because the hashes will match if the underlying values match).

7 Likes

I love this. I think Snowflake makes a lot of this kind of stuff a bit easier than other warehouses–I find its permission model a bit easier to understand and we always end up implementing things in a much more locked-down way on Snowflake vs. Redshift. Adding this layer of PII anonymization that you’ve outlined is a really great improvement that we should be doing more often for our clients. Did you end up doing that right out of the gate or was it an enhancement you made to your warehouse along the way?

This was definitely an iterative process. Initially we did what I’m sure a lot of shops do, and went very heavy-handed. There was to be no PII anywhere, ever, under any circumstances. Everything was obfuscated the second it hit the warehouse, no exceptions. What we found, of course, is that there are always use case exceptions; when business users had legit requirements for reporting that included PII we would send them back to the transactional system, totally undermining the point of BI.
This 2 layer approach has worked really well so far. I’ve had to get creative as we move to marts - appending _private to mart schemas and creating the same 2 layers in a smaller context.

2 Likes