We’ve been using dbt for a quite a while now and loving it! However as great as it is for working inside of the data warehouse, there’s still a lot stuff we need to do before the data gets into the data warehouse and into domain of dbt.
We’ve been benchmarking the data orchestration tools, and we’re considering implementing either Dagster or Prefect. Both of them seem really great and hugely popular inside the scene. And now both of them support dbt as well.
My initial thoughts:
They both seem to have the same standard functionality and great code usability. They work very similarly. However Dagster has a bit more versatility with integrations (jupyter/papermill is appreciated)
Dagster seems to have better UI and tools for debugging data pipelines locally. This is hugely beneficial as data pipelines grow more complex.
Prefect has better cloud operations and less maintenance with native Prefect Cloud service, which is appreciated. We’re happy to pay some premium for less work in maintenance.
Does anybody have any hands-on experience and could give some thoughts? Or any direct recommendations? Or should we consider something else entirely?
I’ve been at the same crossroads just weeks ago, and your thoughts are spot on. It was a tough call, but we ultimately went with Dagster, mostly due to superior tooling (Dagit) + flexible programming model + community. I honestly thing you’ll make a good decision either way, but for us it just seemed like Dagster “thinks” holistically about the process and challenges of making data applications, whereas Prefect solves for developing and executing pipelines in a very ergonomic way, but it’s not as complete.
I saw this post from Nick Schrock on the Dagster slack community, that I think gets at the core of the difference between them:
Dagster pipelines are more structured and constrained. This allows us have a lot of additional features (a type system, a config management system, a system-managed context object that flows through the compute, among other things). By constrast Prefect pipelines are more minimal and dynamic.
Another way of framing this difference is that dagster is very interested in what the computations are doing rather than only how they are doing it. We consider ourselves the application layer for data applications (rich metadata, type system, structured events with semantic meaning etc), whereas prefect frames their software in terms of “negative” and “positive” engineering. This negative/positive framing is more exclusively about the “how” of pipelines: retries, operational matters etc.
It didn’t come up here but Ploomber (GitHub - ploomber/ploomber: The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️) is also a major tool in this space. It has seamless integration with Airflow, Kubeflow, Argo etc, so you can deal only with the core coding part. It’s well integrated with Jupyter and papermill so you can stay in the interactive environment. We recently added monitoring and alerting for pipelines and an easy way to deploy your experiments to the cloud.
On top of that, it has seamless transition to production since some of the work behind the scene is being analyzed and cleaned into .py files.