Airflow Integration

Hey all, I have a dbt project that I’d like to integrate with Airflow and was wondering what the thinking is in terms of best practices.

For context, my dbt project lives in Gitlab repo A and Airflow (cloud composer hosted on GCP) is a shared Airflow instance living in Gitlab repo B.

My current dbt setup has a full CICD with merge’s to master deploying models to our production environment. Ideally I’d like to use Airflow to run tests or build incremental models on varying schedules eg. hourly, daily, etc.

I’ve seen a few different potential patterns that exist and was wondering which one is recommended. Potential solutions:

  • Create a docker image of the dbt repo on merge to master. Use Airflow’s KubernetesPodOperator with this image and run commands on the pod
  • Import my dbt repo as a submodule in the Airflow repo and use the BashOperator
  • Deploy my dbt files to a GCS bucket on merge to master, pull in those files in my Airflow DAGs and use the BashOperator
1 Like

Hey @azhard, at better.com we’re doing doing docker images of our dbt repo plus the KubernetesPodOperator. Pre-Airflow, we were already deploying most of our services using Kubernetes and Docker so this approach was actually the simplest for us to get started.

Some other notes:
• We’re using Astronomer Airflow Enterprise (https://www.astronomer.io/)
• Our general dag design for dbt jobs is we have a dag per data source. Each data source dag roughly looks like source_freshness_test >> source_test >> run_source+_in_staging_schema >> run_tests >> run_source+_in_custom_schemas

Curious to hear how other folks are scheduling dbt as well. For instance, in parallel we’re exploring Prefect as our scheduler.