TLDR: programmatically manage dataset authorization and authorized view creation with a script that parses the dbt yamls and the manifest.
the Why: Here at Butterfly Network, we use Google BigQuery as our Datawarehouse. We structure out warehouse following the approach of raw, prc, and mart datasets managing all of our transformations with DBT. For the most part we grant end-users access to our mart datasets. Some of these are views on top of raw data that could include pii or other information that the end-user is not privy to. For BigQuery the obvious solution was authorized views.
We decided that a PR with DBT code was valid to say that the code is valid and the views that are created are authorized. We created a script that would parse the manifest and authorize the views that were created by DBT, such that the developer had to do no further work or remember to explicitly grant anything on a post hook.
Since we were dabbling in a script to authorize datasets, we also decided that we wanted to include the ability to specify who had access to which dataset in the schema.yml files. This enables our analytics team to have a one-stop shop to do all of their work and we can track when access was granted and why.
Here’s how we have our analyst specify access:
models: -name: <XXX> ... datasets: - name: mart_<XXX> access: - <group1>@butterflynetinc.com - <group2>@butterflynetinc.com - <group3>@butterflynetinc.com
and here’s the script we use to permission https://gist.github.com/emilyevans-bn/9055295f5602806c6fab013ac86711b7
A few call outs: we only permission groups to access the datasets in this way and we do not handle ephemeral models in our parsing because we don’t really use them
Hope this is helpful to someone!