Using dbt on AWS EMR

gabrielwry · January 10, 2023, 7:09pm

Hi, I am a data engineer who is relatively new to dbt, but our data science and analytics team has been using dbt for couple months. And the DE team recently is required to do an audit on the cost for the dbt pipelines.

Our architecture is quite standard:

dbt + spark via thrift server running on AWS EMR, data stored on S3 as parquet files registered in hive catalog.

So far, the main saving opportunity we found was to utilize spot instance rather than running all instances On-demand. But the problem is that from our observation, dbt pipeline rarely use the task instance but most of the work happened on core nodes (where the data node lives). But we don’t want to run core nodes with spot lifecycle as the loss of a core node may introduce data loss. And the run time of the pipeline seems to be in negative correlation with the number of core nodes.

Are there any suggestion on how to isolate the root cause of this behavior? Was it simply because the dbt pipeline we are running is mainly talking to HDFS? So the heavy lifting all happens on core nodes? Or thrift server itself doesn’t utilize task nodes (this might be a stupid question, cause I don’t fully understand how thrift server works, it sounds like a simple JDBC/ODBC, so it shouldn’t be different from a normal spark application right?)

Very much appreciate!

paulskit · July 26, 2024, 9:24am

You can use “CORE(exclusive=true)” label to make sure your jobs gets scheduled on task nodes only.
See this AWS article
https://aws.github.io/aws-emr-best-practices/docs/bestpractices/Features/Spot%20Usage/best_practices/#bp-424-reserve-core-nodes-for-only-application-masters-am

Topic		Replies	Views
DBT Fal or Python with Spark (EMR) Help	1	1663	August 19, 2023
Productionised scheduling of DBT Athena - AWS Help	2	1119	May 26, 2023
DBT Core With EMR Help	0	200	September 13, 2024
Population source tables on AWS stack Archive	0	1678	January 11, 2022
DBT cloud integration with glue job Help dbt-cloud	0	49	January 23, 2025

Using dbt on AWS EMR

Related topics