Hey there!
I am a relatively new dbt user who is nearing the stage of deploying to production, and my team is going to use CI/CD on gitlab. Are there any best practices or words of wisdom we should be aware of as we start setting this up? I’l be working with a dev who is much better at CI/CD but has almost no exposure to dbt architecture/set up.
Thanks in advance!
2 Likes
dbt cloud is a great option depending on the size of your team and data engineering maturity.
At GitLab, we run dbt in production via Airflow. Our DAGs are defined in this part of our repo. We run Airflow on Kubernetes in GCP. Our Docker images are stored in this project.
For CI, we use GitLab CI. In merge requests, our jobs are set to run in a separate Snowflake database (a clone). Here’s all the job definitions for dbt. The rest of the CI pipeline is defined here.
General principles, I think, are that you want to have your MRs run dbt using real data but writing to either a dev schema or a separate DB clone like we do. If you make dbt reference environment variables for where to write then you can control it quite nicely that way. (See our profile here for details on that).
Hope this is useful!
10 Likes
@tmurphyThank you for sharing the knowledge. That would be super helpful. I was wondering how to pass airflow’s macros, especially {{ ds }}
and {{ execution_date }}
. A possible solution I was thinking was using environment variables. So, I was encouraged, seeing the repository. Many thanks!