Pre-proposal: column tags

I would like to be able to tag specific columns and have access to the metadata about them, perhaps only for dbt docs at first but I can see this being useful for automated features as well, much like sources and models tags are already useful.

My main use case is to tag PII data. It would be nice to be able to e.g. mark certain columns to be automatically hashed, or generate a warning, or something along those lines. Or even just surface them in documentation. I could see a PII tag acting something like Perl’s taint mode, and any column derived from PII data is tainted with PII unless specifically scrubbed (perhaps via macro call). Then I could get a list of columns that have been PII tagged, and warnings about untagged PII-tainted columns.

Other potential use cases: team ownership of columns in large orgs. Provide hints and metadata for other data governance requirements (e.g. finance, sox compliance).

I’m making this as a pre-proposal because my thoughts are very preliminary and I am vague on implementation requirements, and would love to solicit the community’s feedback.

2 Likes

Great idea.

I can see an additional use case. Run a certain test on all columns with the same tag. E.g. if there are a lot of lat/long datatypes in a data warehouse, and we have a custom test coded which checks that latitude is always between -90 and 90. You could then say “run this test on all columns which are tagged as ‘latitude’ and run this other test on all columns which are tagged as ‘longitude’”.

Also enables quickly creating and identifying data domains which can be very helpful in a larger data warehouse.

This would be nice for flagging data according to whatever security / sensitivity system you have in place. We do Green, Yellow, Orange, Red data and adding that to the schema.yml would allow for custom reporting.

  - name: sfdc_account
    description: base model for SFDC Accounts
    columns:
      - name: account_id
        tags:
          - green
        tests:
          - not_null
          - unique
      - name: account_name
        tags:
          - orange
          - pii
        tests:
          - not_null

Some validation on tags at a high level would be nice (i.e. specifying what are the available set of tags so that errors aren’t made).

1 Like

Cool proposal @mplovepop! I like your ideas here too @tmurphy and @josh :slight_smile:

See also this relevant issue about tagging tests: https://github.com/fishtown-analytics/dbt/issues/1586

It sounds like column-level tags will be useful for:

  1. selecting column-level tests on models
  2. describing / documenting columns

You can presently select models and tests with the --models tag:{tag name} syntax. Do you think dbt should distinguish between column-level tags and model-level tags? Or should they all just be “tags”?

I’m not sure how compelling it would be to run every model that contains a column with a certain tag, eg: dbt run --models column-tag:pii. Maybe the column-level tags should “bubble up” to the model that contains the tags? So, if a column contains a pii tag, then that whole model is considered to have the pii tag. I don’t think it has to work that way, but it’s kind of an interesting idea!

Are there any other use cases y’all are thinking about here?