Dbt on AWS? What is the best approach?

Hello friends!

We’re thinking of using dbt with our company’s analytical platform - which uses AWS Gue and Presto (Trino now I guess) and few Redshifts here and there - it’s pretty fragmented.

We are now starting few smaller analytical projects that would be great fit to PoC this approach (or maybe I’m just unreasonably hyped about dbt :slight_smile: ) but the question is how?

Redshift would obviously work, but Redshift sits in kind of a strange space where it is rather expensive for smaller initiatives (like 20 table data mart running once a day) but not very performant to make it a proper backbone of everything - at least that’s what it seems, I’d be happy to hear about everyone else’s experience!

Also our core is S3 + Glue Catalog, and single team Redshifts can easily become a silos.

Another option would be to use Spark - but with EMR that’s tricky from access control perspective (AWS access management isn’t exactly easy and/or granular) and getting something like Databricks onboarded is going to take months.

The option I’m looking at most is using dbt with Presto - but there’s two problems:

  1. No snapshots - which is probably fine, since smaller projects tend to either not keep history or rely on Data Lake to get the historical data, but we would need to tackle this somehow eventually.
  2. AWS Glue Catalog (serverless Hive Metastore, basically) doesn’t support renaming tables. So only views/ephemeral materializations work and that won’t give good enough experience to the dashboard users.

So I’m not sure how to tackle this - I guess we could write our own materialization to deal the table rename (not exactly sure how, though - it’s not like Presto can do transactional overwrite) but the optics of presenting a tool to our users while already making workarounds is not great.

So while I think dbt would greatly help move our analytics forward (compared to handwritten Spark code and Redshift stored procedures), I’m not sure what the best approach would be.

I’d be happy for any recommendations, comments or anything.

Thanks!

  1. Replace Redshift, Presto, Glue Catalog with Snowflake… Boom.

I mean, that would work, sure, but even if we got the higher ups to agree on that right away (which is not going to happen), it would still take at lead half a year to get Snowflake assessed and onboarded.

I was looking for some way to get the value faster.

Yeah, it sounds like dbt may not be the best choice for you right now. Its main value-add is as a transform tool within a single data warehouse. I’ve never really used AWS Glue or Presto first-hand, but from what it seems like these do seem somewhat competing with how dbt would traditionally be used (i.e. in a stack where you’re loading all data into one monolithic data warehouse).

I guess if you just want to give it a try, you could probably do as you suggested and use it in small workflows within the Redshift silos before that data gets picked up by AWS Glue. And then if you like that workflow maybe you can start building the case to refactor your data mesh into a single source data warehouse design.