Using this as a spot to jot down some articles that I think would be useful for community members as I see discussions come up in Slack. Feel free to add some more ideas in the comments.
If you feel like you have good things to say on one of these topics, I encourage you to write the article! To help get you started, check out our tips here, and feel free to DM me on Slack if you want to work on it together.
We see this question a lot, and the current solution (event-logging package) is not something we feel good about.
Running one dbt project N times (e.g. once for each customer)
Some companies have the source data with the same structure for different customers, and rather than union the data together, they want to run the dbt project separately for each customer (e.g. because they are presenting this data back to their customer, and don’t want other customers’ data to be surfaced.
There’s not a straightforward way to do this since dbt supports a one model == one output table, but I suspect some companies have come up with reasonably good solutions!
Handling PII in your transformation layer
We need someone to write the bible on this! What are different strategies companies have used to handle this? Are there good opinions about obfuscating at the extract-load level, or doing it in the transformation layer?
Migrating an existing data pipeline to dbt
Assignee: some of my colleagues
How do you break up a migration project? Should you refactor as you go? Are there any good tips to making this process easier?
Integrating python processes with dbt
Assignee: me (but also happy to hand this to someone else if you have good ideas here!)
dbt doesn’t run python. But every so often, there’s a transformation that is better suited for python than SQL. How can you handle this in the most dbtonic way?
We’ve just started playing around with using BigQuery’s AEAD encryption functions to encrypt PII in the transformation layer. The general approach looks something like this:
Generate any encryption key for each user and write it to a table with columns user_id | encryption_keyset using keys.new_keyset('AEAD_AES_GCM_256'). This assumes the existence of a reliable user identifier, without which you won’t be able to encrypt and decrypt data for an individual user. Alternatively, if working with just clickstream/event data, you could choose to generate a keyset per client. We still need to think about how we want to manage access to this table, since that is the deciding factor in whether or not you have access to unencrypted PII. In an ideal world, we’d only ever need to processed PII as part of automated processes running with dedicated service accounts.
Encrypt any PII fields in source tables with aead.encrypt(user_keyset, field_to_encrypt, user_id). The user keyset is retrieved by joining to the table from step 1 on the user identifier. Ideally this would be your first transformation on the source data, after which you might choose to drop the source table or keep only the most recent few days of unencrypted data.
When needed, use aead.decrypt_string(user_keyset, field_to_decrypt, user_id) to decrypt PII fields. Again, this required both access to the user ID used to encrypt the field and the table with encryption keys.
To process data deletion requests, all you need this is the user identifier for the person requesting deletion. Deleting the record for that user from the encryption keys table would take away the ability to decrypt any PII associated with that user, rendering them effectively anonymous. Note that this is tricky if using client IDs, in which case you might want to attempt identity stitching if you collect other user identifiers to identify client IDs associated with the user in question.
You could also enforce a data retention policy by generating one keyset per user per time interval instead of just the one per user. For example, if you’d like to retain 1 year worth of data, you could generate one keyset per user per year and delete keysets older than a year.