DBT on pyspark cost and models

yzhang · November 29, 2022, 7:29pm

We have some data transforms running on AWS EMR with pyspark today, and more tables coming. To evaluate DBT, as we like it keeps the original SQL logic, there are 3 basic questions:

Cost. If DBT with pyspark cost more or less vs. pyspark running on EMR.
Is there a way to just specify partial model in DBT?
a. For example, I have a table with 30 columns, do I have to write all 30 columns in config file for DBT? Or I can write just (i.e.) 5 columns and rest by using a wild char?
b. There are many tables in SQL logic with ‘select * from table’. I don’t care about much on what columns it is. Is it possible for DBT to run without config file for table schema?
Since we already have some jobs running in pyspark in AWS EMR, is there a best practice to convert those jobs to DBT with pyspark?

joellabes · December 1, 2022, 2:29am

Hi @yzhang, good questions!

I don’t know enough about the pricing of pyspark on EMR to be able to answer this question. If you’re using EMR then I assume you’re embedded into the AWS ecosystem and would be using Redshift, in which case your nodes have a fixed cost per month based on their size. If you do something like Snowflake on AWS, then you’re looking at a credit-based system.
- a) You don’t have to add any columns in your .yml files, unless you want to add documentation, quality tests, etc. Columns that aren’t explicitly documented will be inferred in the documentation site by inspecting the information_schema or similar when you run dbt docs generate, and will populate an artifact file called catalog.json.
- b) Yes (see above)
We have a guide on migrating from stored procedures which might be partially relevant. You should also consider using the audit-helper package which will help you generate rollup queries to compare your transformations before and after. For more on this, see How to not lose your mind when auditing data and How to not lose your mind when auditing data: Part II

Topic		Replies	Views
DBT Fal or Python with Spark (EMR) Help	1	1678	August 19, 2023
Understanding DBT Help	5	1290	June 27, 2023
dbt Python model (dbt-py) best practices In-Depth Discussions best-practice , python-models	1	14188	January 19, 2023
snowflake incremental models with virtual columns failures Show and Tell incremental , snowflake	5	1119	April 28, 2024
Can DBT tests compare table structure with a dictionary table? Archive	2	3065	April 4, 2022

DBT on pyspark cost and models

Related topics