I would like to be able to tag specific columns and have access to the metadata about them, perhaps only for dbt docs at first but I can see this being useful for automated features as well, much like sources and models tags are already useful.
My main use case is to tag PII data. It would be nice to be able to e.g. mark certain columns to be automatically hashed, or generate a warning, or something along those lines. Or even just surface them in documentation. I could see a PII tag acting something like Perlâs taint mode, and any column derived from PII data is tainted with PII unless specifically scrubbed (perhaps via macro call). Then I could get a list of columns that have been PII tagged, and warnings about untagged PII-tainted columns.
Other potential use cases: team ownership of columns in large orgs. Provide hints and metadata for other data governance requirements (e.g. finance, sox compliance).
Iâm making this as a pre-proposal because my thoughts are very preliminary and I am vague on implementation requirements, and would love to solicit the communityâs feedback.
I can see an additional use case. Run a certain test on all columns with the same tag. E.g. if there are a lot of lat/long datatypes in a data warehouse, and we have a custom test coded which checks that latitude is always between -90 and 90. You could then say ârun this test on all columns which are tagged as âlatitudeâ and run this other test on all columns which are tagged as âlongitudeââ.
Also enables quickly creating and identifying data domains which can be very helpful in a larger data warehouse.
This would be nice for flagging data according to whatever security / sensitivity system you have in place. We do Green, Yellow, Orange, Red data and adding that to the schema.yml would allow for custom reporting.
- name: sfdc_account
description: base model for SFDC Accounts
columns:
- name: account_id
tags:
- green
tests:
- not_null
- unique
- name: account_name
tags:
- orange
- pii
tests:
- not_null
Some validation on tags at a high level would be nice (i.e. specifying what are the available set of tags so that errors arenât made).
It sounds like column-level tags will be useful for:
selecting column-level tests on models
describing / documenting columns
You can presently select models and tests with the --models tag:{tag name} syntax. Do you think dbt should distinguish between column-level tags and model-level tags? Or should they all just be âtagsâ?
Iâm not sure how compelling it would be to run every model that contains a column with a certain tag, eg: dbt run --models column-tag:pii. Maybe the column-level tags should âbubble upâ to the model that contains the tags? So, if a column contains a pii tag, then that whole model is considered to have the pii tag. I donât think it has to work that way, but itâs kind of an interesting idea!
Are there any other use cases yâall are thinking about here?
Hi @drew, I like your idea of being able to âbubble upâ tags from columns to models. But additionally, I would like to be able to âbuddle forward?â tags from one model to the other. Eg, if I reference a model containing the pii tag, then I would like to add the pii tag to this model as well, as it might actually contain pii. Does that make sense?