DBT on pyspark cost and models

joellabes · December 1, 2022, 2:29am

Hi @yzhang, good questions!

I don’t know enough about the pricing of pyspark on EMR to be able to answer this question. If you’re using EMR then I assume you’re embedded into the AWS ecosystem and would be using Redshift, in which case your nodes have a fixed cost per month based on their size. If you do something like Snowflake on AWS, then you’re looking at a credit-based system.
- a) You don’t have to add any columns in your .yml files, unless you want to add documentation, quality tests, etc. Columns that aren’t explicitly documented will be inferred in the documentation site by inspecting the information_schema or similar when you run dbt docs generate, and will populate an artifact file called catalog.json.
- b) Yes (see above)
We have a guide on migrating from stored procedures which might be partially relevant. You should also consider using the audit-helper package which will help you generate rollup queries to compare your transformations before and after. For more on this, see How to not lose your mind when auditing data and How to not lose your mind when auditing data: Part II

Topic		Replies	Views
DBT Fal or Python with Spark (EMR) Help	1	1729	August 19, 2023
BigQuery + dbt: Incremental Changes Archive	8	40557	June 19, 2020
Understanding DBT Help	5	1331	June 27, 2023
[Obsolete] The exact grant statements we use in a dbt project Archive	4	39700	July 10, 2019
dbt with AWS Athena and Python models. Help python-models	6	3512	February 6, 2023