A containerized dbt environment for your team

gnilrets · March 30, 2021, 6:38pm

Getting a new member of your team set up using dbt can be a challenge. Did you install the same version of Python, dbt? Is it working on your machine but not theirs? It can also be difficult to ensure everyone on your team keeps their dbt environment up to date. This is a common challenge with most software development environments. Fortunately, there is a way to reduce a lot of this pain. It involves running your dbt environment in a docker container which includes an explicit, tested recipe for setting up dbt with all the right dependencies.

I’ve found myself setting up containerized docker environments several times. To make this easier in the future, I set up a dbt container skeleton that can be used to bootstrap a manageable, secure, and containerized dbt development environment. Once it’s initially configured, updating your environment is as simple as

inv build

and running dbt code is just

inv dbt-shell
$ dbt run

See the main dbt container skeleton repo for details.

data_ders · April 6, 2021, 4:52pm

@gnilrets this is awesome. I’m glad to see others working in the same problem space. Our team is growing quickly and dev env set up is probably our largest impediment right now?

Follow up question:

Would you recommend this process for all new team members?

I ask because this solution seems to ensure a stable environment at the expense of initial set up and overall complexity. Is a stable environment what you see new users struggle with the most. Does this solution assume that the new team members are already familiar with anaconda environments?

Even teaching anaconda to someone with only Data Viz and SQL experience is quite a heavy lift already. It was validating too see @aescay’s post last week rationalizing virtualenv over conda envs:

Please don’t read this as critical, I’m just struggling a lot with happy path to ohelping new team members get set up right now. Perhaps we can set up a working group for figuring this out!

gnilrets · April 6, 2021, 6:19pm

Hi @data_ders, glad you liked this! No criticism taken (although accepted if needed).

Unless your team uses dbt-cloud 100%, I would recommend this process (or one like it) for all team members and projects.

I would also argue that this process is actually supposed to make initial set up easier, rather than being an expense. Before I started using containerized development environments (on non-dbt projects), I would spend half a day with a new team member just getting their local environment set up with the right version of python, pip packages, homebrew recipes, etc. There was always a few things that worked on my machine a few months ago that no longer works with more recent packages, and that only gets worse as the project complexity and dependencies grow. And later, making any sort of upgrades to our environment was just as difficult, and we would rarely be willing to go through the pain of upgrading our packages and end up missing out on nice new features or important bug fixes, and our code would just end up rotting.

I personally don’t think setting up miniconda is too big of a lift. After the initial setup, there’s only one command you have know: conda activate myenv. But I can see how virtualenv might be slightly simpler, especially if you’re only going to be managing a single dbt project.

data_ders · April 7, 2021, 7:16pm

Do you currently walk new users through the set up described in the repo? Or do you ask them to do it independently? Do we assume that new users know how to open up a shell into a Docker container?

Also, dumb question – is the idea that the container is hosted on same host as the database itself, to have sort of a remote development environment? Or is this just meant to build a layer of abstraction for end users’ local set up?

Again asking from the perspective of someone also struggling with helping new team members onboard. I wrote a thing about our pain points yesterday

Last thing –

We don’t have any homebrew recipes, so for us, conda does all this for us. The big challenge I find is helping ensure that the conda env will activate for users automatically when they open the project in VSCode. Fortunately now the Python extension auto-activates if there’s a requirements.txt file in the repo. The challenge is helping new users to set their python.PythonPath variable in the settings.json

gnilrets · April 7, 2021, 9:31pm

Some new users are able to follow directions in our repo README.md without any help, which is pretty similar to what is in the dbt-container-skeleton. Others who are less familiar with working on the command line may need some handholding at first. Most of the complexity is meant to wrapped up into simple invoke tasks (via tasks.py).

This setup is meant to be for local development, so the container is built and run on the developer’s laptop. (Since the image is built on the developer’s machine, there is a risk that the build process could differ slightly from dev to dev – if that becomes an issue, then it would be prudent to build a workflow that involved hosting a main image on dockerhub and share that with your team, but I haven’t found that to be really necessary).

aescay · April 19, 2021, 1:55pm

@data_ders I think another important caveat to note with our current solution, which I failed to mention in the main post, was that our workflow centers mostly around data model development and we rarely delve into custom python workflows and scripts. That said, for our team, a light environment handler did the trick because it was rare that we had python (and other programming) dependencies floating around and in flux in our local machines. However, prior to working at Fishtown I was previously in a broader data team and we had a lot of Python and R projects that we were working on simultaneously. In that sort of environment, which is common for a lot of data teams whose resources are pulled in all different directions, I would highly recommend using something like conda (because it is more robust and worth the effort of having fully discrete Python environments), or even dockerizing environments (which gives you full flexibility and isolation, and was what we were using in our organization). Hope this helps you figure out the best environment setup for your team!

gnilrets · April 22, 2021, 11:31pm

Updates! The container skeleton now includes dtspec testing, SQLFluff linting, and a minimal GitHub Actions setup for CI!

mariah.rogers · December 13, 2021, 7:47pm

Hi @gnilrets and followers of this post! @gnilrets, your description of the pain points data teams encounter is so spot on

Our team at Palmetto Energy has also thought a lot about this problem this year. I looked through your repo and found that your approach to containerizing dbt is very similar to the approach my team has taken!

I wanted to share our package, palm, which we just open sourced (and my coworker Emily spoke about at Coalesce last week, the talk called “Data Paradox of the Growth Stage Startup”). Palm is a CLI for your CLIs, that is to say a standardized interface for your multitudes of projects using a common set of customizable commands, which you can implement with custom logic for each project you work in, or share the same logic across projects with similar workflows via plugins.

We wrote a plugin for palm called palm-dbt which has the standard dbt commands built out, along with the ability to generate all the files necessary to containerize your project with a simple command (palm containerize).

Our team also uses palm in 10+ other projects, including Airflow, Great Expectations, Terraform, and more. We don’t have plugins for these written yet because dbt is by far our biggest use-case and our other projects have company-specific logic, but we hope the community at large will help us build out these plugins going forward as well!

The containerize command is included with palm core, so it is available for use in other projects besides dbt. This has helped us tremendously with the Python version juggling issues that others have mentioned in this post. It has also enabled us to forego Python or other environment managers, virtual envs, etc. Just containerize and list dependencies for your project in the requirements file!

You have done great work providing this template for containerization to the community. We would love your contributions to palm if there is something we missed!

korompaii · January 17, 2023, 9:32am

I just wanted to mention VSCode’s devcontainer feature which would play very nicely with this

Topic		Replies	Views
Setting up your local dbt run environments Show and Tell tools , cli , developer-ergonomics	3	16463	August 25, 2023
Deployment in secure environments In-Depth Discussions orchestration-and-deployment	1	8104	September 15, 2019
Running dbt in Kubernetes Show and Tell kubernetes	1	20699	October 23, 2024
Onboarding w/o dbt Cloud Archive	0	3054	April 6, 2021
Using Pyenv to run multiple versions of dbt, per project Archive	3	5997	July 5, 2019

A containerized dbt environment for your team

Related topics