DBT schemas and storage costs

sam · January 22, 2020, 11:24pm

Hi. I’d love to understand some of the thinking in the DBT community about this.

With the recommended DBT workflow using a separate schema for each DBT developer (and potentially creating schemas dynamically to run checks on pull-requests), doesn’t this dramatically increase your data storage requirements?

I’m comparing this to say, Looker PDTs where only the PDT that a developer has changed in a branch are rebuilt (and duplicated). Otherwise Looker uses the production PDT when in development.

My understanding is that in DBT-land people mostly rebuild the whole schema in separate development schemas, regardless of what’s changed?

How do you manage the performance and storage implications of this?

If I you have a team of 5 DBT developers, are folks really keeping 5 duplicate data sets in their warehouses?

How is everyone dealing with that?

Is the solution just using snowflake where storage space matters less? How about for those of us on Redshift?

Perhaps it would help to talk about how this is handled at different data scales?

Thanks for the information and cool tools.

sam

dylanbaker · January 23, 2020, 1:53pm

Hey @sam.

As Drew suggested in his Slack response, there’s definitely an argument that the storage costs are worth the benefit of each developer having their own isolated dev environment.

That said, particularly on Redshift, if storage is an issue (which it can often be) there’s a few different things I’ve seen people do:

Limit the records for really large tables when developing. This can be done with Jinja and typically would be added to base models for very large tables. It would look something like: {% if target.name == 'dev'} where updated_at > current_date - 30 {% endif %}. Obviously insert whatever logic you would want in there. One downside to this is that sometimes it will lead to false negatives when testing in CI because you don’t have complete data. i.e. a fan-out that might happen in production won’t occur in development because the offending records have been filtered out. One way to mitigate this is to mostly filter fact-type tables and leave dimension-like tables in full. The latter tend to be smaller tables anyway.
Drop the dev/CI schemas overnight. Depending on how much development each developer is doing, you can clear up the storage overnight (or at some regular interval) to ensure you aren’t ‘wasting’ storage of dev schemas that aren’t being used. Developers then have to rebuild any tables the next morning, which in most setups shouldn’t be a huge pain. That said, you do have to maintain a process to do this.
Try out another node type. I haven’t tried it out yet, but the new Redshift RA3 nodes would likely be helpful in solving your storage concerns. I believe there are some people in Slack who have tried them out and should be able to provide some info about their experiences migrating. (You could also try the ds2 nodes if they provide enough compute.)

sam · January 23, 2020, 2:59pm

Thanks @dylanbaker - that’s very helpful.

drew · January 23, 2020, 3:22pm

I think @dylanbaker gave an excellent answer above!

The only other things I’d like to add here are about how this works on databases other than Redshift.

Both Snowflake and BigQuery use distributed storage systems:

Snowflake will use S3 (or equivalent) which can scale to store petabytes of data
BigQuery uses Colossus which can scale to (from the linked article) “dozens of petabytes” of storage

Redshift by comparison does not separate data storage from computation. Adding a Redshift node will increase your cluster’s processing capabilities, as well as the available disk space on the cluster. In Redshift databases, you need to be really careful to not exceed an appropriate amount of disk usage. Anecdotally, we’ve seen that Redshift clusters start acting up when the disk is more than 70% full. This type of problem just is not possible to encounter on Snowflake and BigQuery.

We do have some fledgling plans to optionally make dbt run only build changed datasets, deferring to production for anything that has not been changed in the dev environment. Check out this issue (and feel free to add a comment) for more info: https://github.com/fishtown-analytics/dbt/issues/1612

The other big thing to mention in this space is that Snowflake has a notion of Zero-Copy-Clones. There’s a lot to say about them, but to start, check out the docs here: https://docs.snowflake.net/manuals/sql-reference/sql/create-clone.html

Topic		Replies	Views
Structure Snowflake database, schema In-Depth Discussions	19	14900	June 1, 2021
Possibility of using multiple Redshift databases (Dev + prod) with DBT (Cloud or Core) Help redshift	2	2023	July 2, 2024
Best practice for dev environment in a project writing to multiple databases Help environments , snowflake	0	1584	January 23, 2023
How best to copy schemas in redshift? Help	3	1590	June 8, 2023
Understanding DBT Help	5	1300	June 27, 2023

DBT schemas and storage costs

Related topics