Permissions on Sensitive Data

pempey · October 1, 2020, 7:58pm

I am in a situation where some of the data coming into the warehouse is considered sensitive and not for consumption by all members of the organization (i.e. part costs, unit price, general ledger). One solution would be to put the necessary tables in schemas based on who needs access and grant permissions on the schema level. But that breaks up grouping tables by topic and feels like the ‘thin end of the wedge’ on a schema proliferation problem. An other option is manage the permissions on the table level but I have not found a clean way to automate this from within dbt. I imagen it will be something like getting a custom property in to a macro that I can use to deterring what users have access but I am not sure.

So before I start a redesign, is one of these methods (schema vs. Table) better than the other?

RobMcZag · October 3, 2020, 6:12pm

Hi @pempey,
I think you could save yourself some future complications and grant some future flexibility by giving access to roles (with users assigned to roles) through views.

You can have one base (“restricted”) view where the problematic fields are always null and are accessible by a role all users have. This will be the default view of those data.
Then where necessary you create another “unrestricted” view with all the fields, but accessible only by a role assigned only to the people with the right to see the full data.

In this way all users have the base views with all the fields, reports can be built with the common interface and normal users never get the problematic data, but a user with the advanced role could just swap the view and see all the data. You could even leverage some super-power from your BI tool to do that swap transparently depending on the user.

We used a similar approach to remove personal identifiable information (PIIs).

The quickest idea of how to do that in DBT is to have the two set of views created in two / N “parallel” folders with a post-hook per folder to assign the correct role.

The idea of N folders instead of just two is because you might be in a situation where you have N different roles to control access, with one being the generic user and the other granting access to a specific subset of reserved data like purchases, finance, hr and so on…

fabrice.etanchaud · October 5, 2020, 12:49pm

Hi @Peter Empey, Did you already read this paper from gitlab on the topic ? sensitive data

hoping it helps,
Best

pempey · October 5, 2020, 2:44pm

@fabrice.etanchaud Thanks for the link it was a good read and it did provide some insight on possible paths for implementation.

@RobMcZag In the current project I am working on we are using roles, attached to Active Directory, to determine access to the data. Admittedly, I am not keen on the idea of parallels models but the post-hook is similar to what I have been thinking.

Both of the approaches mentioned are a per-column approach. I was thinking on the schema or table, is column going to be better?

RobMcZag · October 6, 2020, 5:26am

Hi @pempey, if you can go with per schema or per table without sacrificing ability to work “normally” both in development and in use of the data platform, then I suppose it is going to be easier to maintain as you have less nodes to assign roles and they are well separated.

The approach that I suggested can help you achieve both the interface compatibility (if it makes sense and it is of any value for your case) and the schema level role assignment, if you put the full views in a “full” schema and the reserved views in a “reserved” schema that is the default for all users.

As for the duplication you can use the dbt-utils star macro that has an exclude field to build a macro that builds the restricted view by removing the restricted fields and returning null on them.

If you want the model name and list of excluded fields can be in a variable coming from a YAML… so you have a restricted view that can contain always the same call and all the config is in a YAML file.
And If you go for different schema the full and restricted view can have the same name.

You could even have the YAML at least partially generated by SQL looking at the existing tables/views in the full schema (and maybe excluding the similarly named views in the restricted schema).

Example of the star macro:
select {{ dbt_utils.star(from=ref('my_model'), except=["exclude_field_1", "exclude_field_2"]) }} from {{ref('my_model')}}

pempey · November 12, 2020, 4:47pm

So I ran into some technical hurdles with the way the client was deploying groups (Active Directory ->Azure? -> AWS roll -> Redshift group) and had to restructure a few things but here is the current path forwarded.

Permissions will be at the table level and there hierarchy of sorts to give access to increasingly sensitive information. This was combined with controlling what level of curated tables (reports, dimensional models, staged data, raw data) a user will have access to as well so that all permissions can be managed with one process.

It starts with a series of variables in the project file to define the list of groups a table will be given select access to.

vars:
    # Data warehouse permission Groups
    reports: ['reports','analysis','privileged','sensitive','development','administration']
    analysis: ['analysis','privileged','sensitive','development','administration']
    privileged: ['privileged','sensitive','development','administration']
    sensitive: ['sensitive','development','administration']   
    development: ['development','administration']
    administration: ['administration']
    all_groups: ['reports','analysis','privileged','sensitive','development','administration']

Then a default set of permissions is applied in the project file using a custom configuration using the name of variable. Note this is just the name of the variable and not the result of the var function. And a macro is applied as a post hook to generate the permission granting statements.

models:
    my-project: 
        post-hook: ["{{ apply_security() }}"]       
        staging:            
            +security_group: 
                - 'development'        

        marts:            
            +security_group: 
                - 'analysis'

        reporting:           
            +security_group: 
                - 'reports'

The macro to generate the permission statements compares the provided list with the full list and generates a grant or revoke statement accordingly.

{% macro apply_security() -%} 

{%- set security_groups = config.get('security_group') -%}

{%- set full_list = var('all_groups') -%}

{%- for security_group in security_groups -%}
    {%- set groups = var(security_group)-%}
    {%- for group in full_list -%}
        {% if group is in groups %}
             grant select on {{this}} to group {{ group }} ; 
        {% else %}
             revoke select on {{this}} from group {{ group }};
        {% endif %}
    {%- endfor %}
{%- endfor %}

{%- endmacro %}

When a table is identified as needed a level different then the default can be overridden with a model configuration.

 {{
  config(
    security_group = ['sensitive']
  ) 
}}

This set up allows for procedural control of access on a model level, at least for this client and there set up.

Topic		Replies	Views
[Obsolete] The exact grant statements we use in a dbt project Archive	4	39348	July 10, 2019
Automated access management to PII with dbt and Immuta Archive	0	6579	September 2, 2021
Permissioning & Relations with Redshift Archive best-practice , redshift , privileges	2	3828	October 4, 2018
Database privilege error when creating a view in a schema Help	3	1609	August 12, 2024
Supporting multiple analysts with different permissions Archive	3	3100	May 4, 2020

Permissions on Sensitive Data

Related topics