Does anyone have any design patterns for building sampled development environments in dbt? This is so analysts can test things out quickly against production, without being slowed down by the large size of the dataset in prod. I’m currently just doing something like the following (sample a random 10%, up to a variable amount), but it looks a little messy:
{% if target.name == "dev" %}
AND MOD(CAST(RAND()*10 AS INT64),10)=0
LIMIT {{ var("max_results_in_dev") }}
{% endif %}
In general, I’d love to see any guidance or wisdom people have around the intricacies of introducing dbt into teams / production settings, et al.
2 Likes
I’ve seen companies effectively deploy your strategy @sjwhitworth, but it can require a lot of work. Overall here are my suggestions (feel free to ignore if you find another way that works for you!):
- keep all dimension data. don’t leave out certain customers, etc. this is onerous both to write and maintain: it requires a lot of sprawl-y code to make things consistent everywhere. it’s also hard to test.
- limit large fact tables to a configurable # of days prior to
current_timestamp()
. configure via var()
.
- apply all filtering within base models, and probably materialize these base models as tables so that you’re not executing the where clause over and over again. this depends on what your DAG looks like though.
- try to implement filtering on top of efficient keys depending on your data store. partitions in bigquery, sorts in redshift, etc.
Our just-released best practices guide shares code on how to implement #2, above.
Hope that’s helpful!
1 Like