We’re thinking of using dbt with our company’s analytical platform - which uses AWS Gue and Presto (Trino now I guess) and few Redshifts here and there - it’s pretty fragmented.
We are now starting few smaller analytical projects that would be great fit to PoC this approach (or maybe I’m just unreasonably hyped about dbt ) but the question is how?
Redshift would obviously work, but Redshift sits in kind of a strange space where it is rather expensive for smaller initiatives (like 20 table data mart running once a day) but not very performant to make it a proper backbone of everything - at least that’s what it seems, I’d be happy to hear about everyone else’s experience!
Also our core is S3 + Glue Catalog, and single team Redshifts can easily become a silos.
Another option would be to use Spark - but with EMR that’s tricky from access control perspective (AWS access management isn’t exactly easy and/or granular) and getting something like Databricks onboarded is going to take months.
The option I’m looking at most is using dbt with Presto - but there’s two problems:
- No snapshots - which is probably fine, since smaller projects tend to either not keep history or rely on Data Lake to get the historical data, but we would need to tackle this somehow eventually.
- AWS Glue Catalog (serverless Hive Metastore, basically) doesn’t support renaming tables. So only views/ephemeral materializations work and that won’t give good enough experience to the dashboard users.
So I’m not sure how to tackle this - I guess we could write our own materialization to deal the table rename (not exactly sure how, though - it’s not like Presto can do transactional overwrite) but the optics of presenting a tool to our users while already making workarounds is not great.
So while I think dbt would greatly help move our analytics forward (compared to handwritten Spark code and Redshift stored procedures), I’m not sure what the best approach would be.
I’d be happy for any recommendations, comments or anything.