Managing sensitive data access in a shared dbt repository

Hi everyone,

I’m looking for some insights on how to manage sensitive data access in a shared data warehouse project.

The Problem

We have a Kimball-style data warehouse repository managed with dbt. We are planning to add People Analytics models to this repository, which means bringing in HR-related data that may contain sensitive information, such as employee salaries, compensation-related data, wealth-related data, and other confidential business information.

Ideally, the entire data team should be able to contribute to the repository, maintain models, review code, and work on the data warehouse logic. However, only a small subset of authorized users should be allowed to view certain sensitive columns in the tables produced by some dbt models.

The main challenge is finding the right balance between:

  • Collaborative development in a shared dbt repository;
  • Maintainability of the dbt project;
  • Proper separation between code access and data access.

There are also a few specific constraints that make this more challenging:

1. Databricks admin access

We currently have a few people with the admin role in Databricks in this workspace, but only one of them should be allowed to access this sensitive data. Therefore, whatever solution we implement needs to take into account that admin-level permissions may bypass some access restrictions if not carefully designed.

2. dbt runs through a service principal

Our dbt jobs run through a service principal in the repository. Because of that, we need to ensure that even though dbt is able to build the models, only the authorized users can actually query or view the sensitive data produced by those models.

3. Concerns about relying only on tags

One possible approach would be to use tags or metadata to identify sensitive models or columns. However, this seems potentially fragile because anyone with access to edit the dbt code could theoretically remove or modify the tags. Even if tags are applied through a post-hook, someone could still change the hook logic or remove it from the model. So tags alone are not enough for enforcing sensitive data governance with these caveats.

Questions

I would appreciate any advice or examples from teams that have dealt with this kind of setup.

1. How do teams usually separate permissions for writing or reviewing dbt code from permissions for actually querying sensitive data?

2. How should teams handle cases where some users have Databricks admin permissions but should not be allowed to view specific sensitive datasets?

3. What is the recommended pattern when dbt runs with a service principal that has permission to build sensitive models, but only a few users should be able to query the resulting data?

4. Are tags, post-hooks, or metadata-based approaches reliable enough for this kind of governance, or should enforcement always happen outside dbt at the warehouse/catalog/security layer?

5. Are there any common patterns, best practices, or governance frameworks for this type of setup in dbt-based data warehouses?


Any practical examples, architectural patterns, or lessons learned would be very helpful.

Thanks in advance! :smiley: