DBT on pyspark cost and models

Hi @yzhang, good questions!

  1. I don’t know enough about the pricing of pyspark on EMR to be able to answer this question. If you’re using EMR then I assume you’re embedded into the AWS ecosystem and would be using Redshift, in which case your nodes have a fixed cost per month based on their size. If you do something like Snowflake on AWS, then you’re looking at a credit-based system.
    • a) You don’t have to add any columns in your .yml files, unless you want to add documentation, quality tests, etc. Columns that aren’t explicitly documented will be inferred in the documentation site by inspecting the information_schema or similar when you run dbt docs generate, and will populate an artifact file called catalog.json.
    • b) Yes (see above)
  2. We have a guide on migrating from stored procedures which might be partially relevant. You should also consider using the audit-helper package which will help you generate rollup queries to compare your transformations before and after. For more on this, see How to not lose your mind when auditing data and How to not lose your mind when auditing data: Part II
1 Like