Hi, I am a data engineer who is relatively new to dbt, but our data science and analytics team has been using dbt for couple months. And the DE team recently is required to do an audit on the cost for the dbt pipelines.
Our architecture is quite standard:
dbt + spark via thrift server running on AWS EMR, data stored on S3 as parquet files registered in hive catalog.
So far, the main saving opportunity we found was to utilize spot instance rather than running all instances On-demand. But the problem is that from our observation, dbt pipeline rarely use the task instance but most of the work happened on core nodes (where the data node lives). But we don’t want to run core nodes with spot lifecycle as the loss of a core node may introduce data loss. And the run time of the pipeline seems to be in negative correlation with the number of core nodes.
Are there any suggestion on how to isolate the root cause of this behavior? Was it simply because the dbt pipeline we are running is mainly talking to HDFS? So the heavy lifting all happens on core nodes? Or thrift server itself doesn’t utilize task nodes (this might be a stupid question, cause I don’t fully understand how thrift server works, it sounds like a simple JDBC/ODBC, so it shouldn’t be different from a normal spark application right?)
Very much appreciate!