We have some data transforms running on AWS EMR with pyspark today, and more tables coming. To evaluate DBT, as we like it keeps the original SQL logic, there are 3 basic questions:
Cost. If DBT with pyspark cost more or less vs. pyspark running on EMR.
Is there a way to just specify partial model in DBT?
a. For example, I have a table with 30 columns, do I have to write all 30 columns in config file for DBT? Or I can write just (i.e.) 5 columns and rest by using a wild char?
b. There are many tables in SQL logic with ‘select * from table’. I don’t care about much on what columns it is. Is it possible for DBT to run without config file for table schema?
Since we already have some jobs running in pyspark in AWS EMR, is there a best practice to convert those jobs to DBT with pyspark?
I don’t know enough about the pricing of pyspark on EMR to be able to answer this question. If you’re using EMR then I assume you’re embedded into the AWS ecosystem and would be using Redshift, in which case your nodes have a fixed cost per month based on their size. If you do something like Snowflake on AWS, then you’re looking at a credit-based system.
a) You don’t have to add any columns in your .yml files, unless you want to add documentation, quality tests, etc. Columns that aren’t explicitly documented will be inferred in the documentation site by inspecting the information_schema or similar when you run dbt docs generate, and will populate an artifact file called catalog.json.