How to implement dynamic data masking on Redshift

Ashley Van Name from JetBlue gave an excellent talk at Coalesce talk that shares how JetBlue was able to secure their data using dbt in combination with a Snowflake feature, dynamic data masking.

The nice thing about the Snowflake feature is that the policy is applied at run time — regardless of whether the underlying data is stored as a table or view.

For those of us not on Snowflake, it begs the question — what options do we have? How can we apply column-level policies to make sure our data is secure? Here is our recommended approach for Redshift users, would love to know if the same approach can be used by BigQuery users too!

Part 1: Understanding data masking logic

In essence, we want to run logic that masks data depending on who is running the query. We can achieve this with a case statement, the current_user() function, and the md5 function:

select
 case current_user()
   when 'admin' then birthday
   else md5(birthday)
 end as masked_birthday,
 ...

from ...

If we encode this logic into a view definition, when a user named admin selects from this view, they will see someone’s birthday, whereas anyone else will see an obfuscated value.

create view dbt_dave.users as (
  select
   case current_user()
     when 'admin' then birthday
     else md5(birthday)
   end as masked_birthday,
   ...

  from …
)

It’s important to note that if you were to create this model as a table instead, the value would get hard-coded depending on who ran the create table as statement, thereby removing the “dynamic” aspect of this approach.

Part 2: Implementing this logic in a dbt project

Step 1. Turn the logic into a macro

Since we will probably want to use this logic in multiple places, it’s worth turning it into a macro:

{% macro mask_column(column_name) %}

 case current_user()
   when 'admin' then {{ column_name }}
   else md5({{ column_name }})
 end as {{ column_name }}
 
{% endmacro %}

Step 2, Take 1: Incorporate it into a model (that’s materialized as a view)

At first, we called the logic in models where we needed to mask data:

{{ config(materialized=’view’) }}
select
  {{ mask_column(‘first_name’) }} as first_name,
  {{ mask_column(‘last_name’) }} as last_name,
  favorite_bagel_flavor
from {{ ref(‘employees’) }}

This worked, but there was no safeguard against someone building this model as a table. We considered working around this by raising an exception if a non-view materialization was called, but that only solved half the problem - if a downstream model was materialized as a table, it would undo our good work. Further, it would be hard to manage our permissions correctly if both sensitive and masked data lived in the views and tables within the same.

It instead seemed cleaner to figure out a way to create a view with the masking logic on top of our model, in a separate schema, and only expose that schema to our end users.

Step 2, Take 2: Views on top of models

For each model that we need to apply data masking to, we want to create a view that masks any PII columns, in a separate schema. So, we created a macro that takes two arguments — the schema we want to materialize the view in, and the columns we deem to be PII. We also need to first create the schema if it doesn’t yet exist.


{% macro create_data_masked_view(schema, columns_to_mask) %}
   {% if execute %}
  
   {# get all columns in the relation #}
  
       {% set model_cols = adapter.get_columns_in_relation(this) %}
      
       {# create Relation object for masked view #}
      
       {%- set masked_view = api.Relation.create(
             database=this.database,
             schema=schema,
             identifier=this.identifier) -%}
            
       {# create schema #}
      
       {% do adapter.create_schema(masked_view) %}
      
       {# create masked view in new schema for sensitive columns #}
      
       {% set view_sql %}
 
           create view {{ masked_view }} as (
 
               select
                   {% for col in model_cols %}
                       {% if col.name in columns_to_mask %}
                       {{ mask_column(col.name) }} as {{ col.name }}
                       {% else %}
                       {{ col.name }}
                       {% endif %}
                       {{ "," if not loop.last }}
                   {% endfor %}
               from {{ this }}
           )
 
       {% endset %}
      
       {% do run_query(view_sql) %}
      
       {% do dbt_utils.log_info("Masked view created at: " ~  masked_view ) %}
      
   {% endif %}
  
   select 1=1
  
{% endmacro %}

Since we need to run this after a model is created, post-hooks come in really handy here:

{{
   config(
       post_hook="{{ create_data_masked_view(
           schema='public_analytics',
           columns_to_mask=['first_name', 'last_name']
       ) }}"
   )
}}
 
select
...

We put together an example project here, which you can clone and try out yourself!

Step 3: Granting Redshift permissions

The last step here is to grant permissions on the views in each schema to the users who should/should not be able to view the protected data. This would look something like:

-- analysts get access to the schema with our masked view
grant all on schema public_analytics to group analysts;
grant select on all views in schema public_analytics to group analysts;

-- admins get access to the private schema with unmasked PHI
grant all on schema private_analytics to group admins;
grant select on all views in schema private_analytics to group admins;

Additional considerations

  • This same approach could be implemented as a UDF within Redshift rather than a macro. The same case statement logic could be defined as a custom function, and referenced in the dbt code directly. More on UDFs here.
  • This should be extensible to Big Query. Rather than the current_user() variable, Big Query has a session_user() variable that can be referenced. We haven’t done any testing on this, but theoretically it should work in the same way!
  • We use a simple md5 function to mask the data, but there are other, more secure ways of hashing. As discussed in this related article, as well as Ashley’s talk, you can incorporate a “salt” value to add deterministically-random numbers to the database values prior to hashing to improve the security of the PHI data. Alternatively, you could use sha rather than md5.
  • We chose to maintain the list of users who get access to the PHI directly in the macro. For simplicity of updating, this could be maintained as a project variable referenced in the macro, and the macro itself wouldn’t need ongoing maintenance.
  • Additionally, rather hardcoding the list of users with access to the protected schema, you could follow specific naming conventions for your Redshift users, and reference those in the macro. For example, all users in the admins user group could have the prefix admin_ on their username, and then the logic could look like:
select
 case 
   when current_user() ilike 'admin_%' 
      then birthday
   else md5(birthday)
 end as masked_birthday,
 ...

from ...

Alternative approaches:

  • This dynamic masking could also be done at the column level in your BI tool. If you’re using something like Looker, you can use their user attributes functionality to dynamically allow users to see the PHI in the underlying data tables. This is a little more difficult to maintain, as you’d need to write the logic directly into the column definition of each field of interest, and ensure that only views are connected to your BI tool. We feel like this is more easily managed in the transform layer, but it can certainly be done here too!
2 Likes

This is awesome! Great writeup @dave.connors.

Had you considered using the meta property on columns to tag your columns as containing PII instead of hardcoding the list of columns in each place?

1 Like

Hey @joellabes !

This is a fantastic question, and one we probably should have considered earlier on! :sweat_smile:

The meta property can absolutely be used to tag columns as PHI/PII, like the example in the docs site. This column property is available in the column object, and can be referenced in the macro jinja.

@claire took a swing at an updated version of this macro on this branch. This version does not accept columns_to_mask as an argument in the macro, and instead relies on a PII meta tag in the schema.yml configuration. Otherwise, it works exactly the same, and doesn’t require any updating of the post hook! Super elegant!

The choice between the original version and this meta version will depend on where you want the complexity of tagging PII columns to live. You can either have slightly more complex post_hook logic to directly specify the columns to mask, or slightly more complicated schema.yml files.

Thanks again for the suggestion Joel!

1 Like

Another approach, assuming each user querying the database is using their own user (including on Looker, leveraging parametrized connections) rather than a service account, you could use the ABAC (attribute-based access control) pattern.

It requires maintaining a table containing users and attributes. You then join that attribute table to the table on which you want to apply security within a view, and only expose the view to the user.

Here’s a SQLFiddle demonstrating row-level security AND dynamic data masking based on the user querying the table.

2 Likes