Personally Identifiable Information (PII) is a real business concern with Data Warehouses. GDPR has more people talking about it now, but even without the new legislation it is something that needs careful consideration.
Having a well structured Data Warehouse with exposed PII makes walking out the door with your entire customer email list as simple as “SELECT * FROM customers”… and that’s scary.
Problem is, there are legit times you will need PII access. If an analyst is doing a study that needs components of the PII (like email provider), then a hashed string won’t be very helpful. If you want to do ‘fuzzy matching’ when creating unified customer profiles, you need to start with raw PII.
Here’s how we tackle it at RevZilla:
-
Transform common PII values (first name, last name, email address) in an ephemeral model using a standardized macro. Simple transforms like trimming and down-casing allow us to compare apples to apples.
-
Materialize a hash salt under a _PRIVATE_schema, something like this:
– DB specific ways of random gen, I just did this as a semi-universal example.
– ideally you would do ascii lookups on the integers the rand returns.
SELECT RANDOM()::text AS salt LIMIT 1 -
We materialize our production tables in a PRIVATE schema, with the prefix PRIVATE on the table. I.e ANALYSIS.PRIVATE.PRIVATE_USERS. This schema is restricted and only our PII_GRANTED role can access it.
-
We create a view of the table in the appropriate business schema. The view should be sans prefix, I.e. ANALYSIS.MARKETING.USERS. This view includes a join to our SALT table. Then it is as a simple macro for each PII field that hashes the value with the salt:
{%- macro anaon(col, salt) -%}
MD5({{col}}, {{salt}})
{%- endmacro -%}
And now the users have a safe, hashed value that they can aggregate on (because the hashes will match if the underlying values match).