Ashley Van Name from JetBlue gave an excellent talk at Coalesce talk that shares how JetBlue was able to secure their data using dbt in combination with a Snowflake feature, dynamic data masking.
The nice thing about the Snowflake feature is that the policy is applied at run time — regardless of whether the underlying data is stored as a table or view.
For those of us not on Snowflake, it begs the question — what options do we have? How can we apply column-level policies to make sure our data is secure? Here is our recommended approach for Redshift users, would love to know if the same approach can be used by BigQuery users too!
Part 1: Understanding data masking logic
In essence, we want to run logic that masks data depending on who is running the query. We can achieve this with a case statement, the current_user() function, and the md5
function:
select
case current_user()
when 'admin' then birthday
else md5(birthday)
end as masked_birthday,
...
from ...
If we encode this logic into a view definition, when a user named admin
selects from this view, they will see someone’s birthday, whereas anyone else will see an obfuscated value.
create view dbt_dave.users as (
select
case current_user()
when 'admin' then birthday
else md5(birthday)
end as masked_birthday,
...
from …
)
It’s important to note that if you were to create this model as a table instead, the value would get hard-coded depending on who ran the create table as
statement, thereby removing the “dynamic” aspect of this approach.
Part 2: Implementing this logic in a dbt project
Step 1. Turn the logic into a macro
Since we will probably want to use this logic in multiple places, it’s worth turning it into a macro:
{% macro mask_column(column_name) %}
case current_user()
when 'admin' then {{ column_name }}
else md5({{ column_name }})
end as {{ column_name }}
{% endmacro %}
Step 2, Take 1: Incorporate it into a model (that’s materialized as a view)
At first, we called the logic in models where we needed to mask data:
{{ config(materialized=’view’) }}
select
{{ mask_column(‘first_name’) }} as first_name,
{{ mask_column(‘last_name’) }} as last_name,
favorite_bagel_flavor
from {{ ref(‘employees’) }}
This worked, but there was no safeguard against someone building this model as a table. We considered working around this by raising an exception if a non-view materialization was called, but that only solved half the problem - if a downstream model was materialized as a table, it would undo our good work. Further, it would be hard to manage our permissions correctly if both sensitive and masked data lived in the views and tables within the same.
It instead seemed cleaner to figure out a way to create a view with the masking logic on top of our model, in a separate schema, and only expose that schema to our end users.
Step 2, Take 2: Views on top of models
For each model that we need to apply data masking to, we want to create a view that masks any PII columns, in a separate schema. So, we created a macro that takes two arguments — the schema we want to materialize the view in, and the columns we deem to be PII. We also need to first create the schema if it doesn’t yet exist.
{% macro create_data_masked_view(schema, columns_to_mask) %}
{% if execute %}
{# get all columns in the relation #}
{% set model_cols = adapter.get_columns_in_relation(this) %}
{# create Relation object for masked view #}
{%- set masked_view = api.Relation.create(
database=this.database,
schema=schema,
identifier=this.identifier) -%}
{# create schema #}
{% do adapter.create_schema(masked_view) %}
{# create masked view in new schema for sensitive columns #}
{% set view_sql %}
create view {{ masked_view }} as (
select
{% for col in model_cols %}
{% if col.name in columns_to_mask %}
{{ mask_column(col.name) }} as {{ col.name }}
{% else %}
{{ col.name }}
{% endif %}
{{ "," if not loop.last }}
{% endfor %}
from {{ this }}
)
{% endset %}
{% do run_query(view_sql) %}
{% do dbt_utils.log_info("Masked view created at: " ~ masked_view ) %}
{% endif %}
select 1=1
{% endmacro %}
Since we need to run this after a model is created, post-hooks come in really handy here:
{{
config(
post_hook="{{ create_data_masked_view(
schema='public_analytics',
columns_to_mask=['first_name', 'last_name']
) }}"
)
}}
select
...
We put together an example project here, which you can clone and try out yourself!
Step 3: Granting Redshift permissions
The last step here is to grant permissions on the views in each schema to the users who should/should not be able to view the protected data. This would look something like:
-- analysts get access to the schema with our masked view
grant all on schema public_analytics to group analysts;
grant select on all views in schema public_analytics to group analysts;
-- admins get access to the private schema with unmasked PHI
grant all on schema private_analytics to group admins;
grant select on all views in schema private_analytics to group admins;
Additional considerations
- This same approach could be implemented as a UDF within Redshift rather than a macro. The same case statement logic could be defined as a custom function, and referenced in the dbt code directly. More on UDFs here.
- This should be extensible to Big Query. Rather than the current_user() variable, Big Query has a session_user() variable that can be referenced. We haven’t done any testing on this, but theoretically it should work in the same way!
- We use a simple md5 function to mask the data, but there are other, more secure ways of hashing. As discussed in this related article, as well as Ashley’s talk, you can incorporate a “salt” value to add deterministically-random numbers to the database values prior to hashing to improve the security of the PHI data. Alternatively, you could use sha rather than md5.
- We chose to maintain the list of users who get access to the PHI directly in the macro. For simplicity of updating, this could be maintained as a project variable referenced in the macro, and the macro itself wouldn’t need ongoing maintenance.
- Additionally, rather hardcoding the list of users with access to the protected schema, you could follow specific naming conventions for your Redshift users, and reference those in the macro. For example, all users in the
admins
user group could have the prefixadmin_
on their username, and then the logic could look like:
select
case
when current_user() ilike 'admin_%'
then birthday
else md5(birthday)
end as masked_birthday,
...
from ...
Alternative approaches:
- This dynamic masking could also be done at the column level in your BI tool. If you’re using something like Looker, you can use their user attributes functionality to dynamically allow users to see the PHI in the underlying data tables. This is a little more difficult to maintain, as you’d need to write the logic directly into the column definition of each field of interest, and ensure that only views are connected to your BI tool. We feel like this is more easily managed in the transform layer, but it can certainly be done here too!