Does DBT Cloud store any of my warehouse source data ? I expect the answer is NO since DBT just runs queries against the source - but I just want to check. What I am really asking is there any data governance risk in using DBT Cloud in dev and prod vs. running DBT inside, say, a customer’s AWS VPC.
Hey @johnoscott! Great question. I’m going to answer based on dbt Cloud’s current design and implementation, although it’s conceivable that some parts of this answer could change in the future as the product evolves. If anyone is reading this in the distant future, feel free to ping me to see if anything has changed and I’ll keep this thread updated.
At the moment, dbt Cloud stores the following data persistently:
- your dbt Cloud account information. things like job definitions, database connection information, users, etc.
- logs associated with jobs and interactive queries you’ve run.
- your dbt “assets”: things like
run_results.json
andmanifest.json
.
In #1, we can be sure that this does not include any raw data from your warehouse because we know exactly what type of information is stored here. For #2 and #3, that gets just slightly more complicated, because you control these assets.
Here’s one example: it’s totally possible to write dbt code fetches all customer data from your customers table and then writes it out to the logs. This is almost definitely a bad idea, but it is possible to do. If one were to write this code, the logs would contain all customer data and therefore dbt Cloud would store it.
So, the more complicated answer is “no, dbt Cloud doesn’t store you data from your warehouse unless you specifically write some particular piece of code that will cause that data to be written to the logs, or to a compiled dbt asset.”
Finally, dbt Cloud does have data from your warehouse pass through its infrastructure when writing interactive queries in the IDE. If you write select * from customers limit 100
, the data from your customers table will pass through the dbt Cloud infrastructure on the way to your browser. At the moment, there are no steps in that process that persist the data: dbt Cloud doesn’t perform any caching or other behavior whereby that data lives on our servers outside of your browser session.
As a result of all of the above, dbt Cloud’s data security / governance responsibilities are somewhat more straightforward than they would be if it were, for instance, a data warehouse or a data integration tool (both of which persist your data). Even so, dbt Cloud typically has to have a very high level of access to your data warehouse in order to do its job, and as such we take security extremely seriously. Take a look at our security page to learn more, and feel free to ping us at support@getdbt.com if you have specific questions about your account.
Hey @tristan ! Thanks for this detailed answer. Just checking to see if there are any updates to be aware of in June 2022.
Huge dbt fan by the way
Hey @lucash - still the same! I was going to quote the data storage section of our security page, and then I realised that it was basically a paraphrasing of Tristan’s original reply.
Since this was written, we have also achieved a bunch of security-related examinations, including SOC2 Type II, ISO27001:2013 and ISO27701:2019. For the long versions, check out the full compliance section.
Resetting the timer on this:
Looking forward to writing the next update to this from my flying car
@joellabes Awesome! Thank you for the response. I figured it wouldn’t be too different, if at all, just wanted to confirm