Question how everyone handles their schema management for staging data versus transformations done on staging data.
It’s a good rule of thumb to not let users access the staging data, because as you build a warehouse you often need to constantly change how the staging and transformation process works, and if users write analytics/reports directly against the staging data the analytics will end up breaking as you change the staging and loading processes. And if users can somehow find and access data they will write reports against it one way or another.
To enable this, in a “traditional” DW, I am accustomed to having a separate schema or even database for the staging data versus the end-user accessible transformed data area and using the schema as a global way to ensure that the groups/roles to which end-users are assigning are blocked from the staging area.
We are trying to figure out how to handle this with dbt. With our current warehouse project in Snowflake, we have a “staging” schema and an “analytics” schema. We have multiple analysts editing and working on dbt code and data transformations. We also want to keep all of the dbt transformations in one dbt project to have one DAG that makes understanding the data workflow simpler for all concerned rather than splitting “staging” processing into one dbt project and “analytics” processing into another.
The problem that we are encountering is that in order to allow analysts to be working on different code branches and data transformation flows independently, we want to use dbt’s capabilities to allow them to just rename the schema and dbt should automatically create their own independent schemas. However, this doesn’t work so well with the staging schema because we only want to maintain one copy of that, and some of the dbt tasks involve creating additional worktables and views in the staging schema. We also have some parts of the process that go staging → analytics → staging → analytics because of the complexity of some transformations.
How do others handle this? Do you just put everything in one schema and try to be careful at a granular level about granting end-users access to only particular tables/views? Does everybody get their own copy of two schemas? Or are we just over-complicating all of this?
Can you elaborate on the difference between “staging” and “analytics” schemas for your team? In particular, you mentioned:
However, this doesn’t work so well with the staging schema because we only want to maintain one copy of that
In general, the dbt workflow supports each analyst having their own copies of all of the models in the project. If each analysts shares a single staging schema, I can imagine things getting pretty hairy pretty quickly!
If you haven’t already, check out the docs on custom schemas. By default, these will add a suffix to your “base” schema name. That will let you create schemas in development like:
- dbt_josh
- dbt_josh_staging
dbt will handle the materialization logic for your models, so you can definitely have models that weave between staging and “analytics” tables.
With custom schemas, you can override the schema name logic on an environment-specific basis. This functionality lets you eg. create all of your models in dbt_josh in dev, but split out the schemas into analytics and analytics_staging in prod. Check out the docs on doing that here and let me know if you have any questions!
Sounds like you are suggesting each analyst having their own separate staging schema as well. In our scenario this means that dbt would need to be creating and building two schemas per analyst, not just one, as we need dbt to be able to create and modify many of the objects in the staging schema as well as the analytics schema.
We tried this a bit early on but seemed to get a lot of errors and had difficulty with doing it. If we collapse everything into one schema, then we run into the issue of greater risk of end-users seeing data that they should not.
How do you suggest handling scenarios where the analysts are doing a bunch of staging transformations, involving both source data and intermediate steps (whether views or actual, materialized tables), but where the end results of the transformations should be a queryable, cleaned up model of data for usage by another separate set of end-users who should not be able to see the intermediate steps?
So, this isn’t a super unusual or uncommon use case – this is exactly why we build custom schemas! I’d do exactly the thing you’re describing: materialize your production models into the standard prod schema (eg. analytics), and materialize all of your intermediate transformations into some scratch schema (like analytics_staging or similar).
You can permission the schemas differently s.t. only analysts have access to the staging schema, but all relevant users have access to the production schema.
So, I’d totally recommend just using custom schemas here! You mentioned:
We tried this a bit early on but seemed to get a lot of errors and had difficulty with doing it.
Can you elaborate on what kinds of errors you saw or where the difficulty came in? Happy to help you get it sorted if you want to give it another try