Running dbt in Kubernetes


#1

Hey there! We’ve been working with dbt and using Sinter for quite a while now. We felt that we needed a bit more control and decided to give a try to running and scheduling dbt in one of our Kubernetes clusters. In this post I’d love to share how we did it.

Running dbt on Kubernetes

The main objective is to replace Sinter scheduled runs with a custom environment based on Kubernetes Cronjobs. This gives some flexibility but it also takes some time to setup and maintain :wrench: .

1. Creating the Docker Image

Lets assume a dbt project with the default folder structure is in place. Running that project with dbt using Docker is as simple as writing a similar Dockerfile to this one, then building(docker build -t your-dbt-image-name:tag .) and running it.

FROM davidgasquez/dbt:0.10.0

COPY your-dbt-folder /dbt

CMD ["dbt", "run", "--profiles-dir", "profile"]

The image in which is based is dbt-docker. You can see a complete example in the kubedbt repository.

The CMD step is using a flag to specify the source of the profiles file. This is really helpful when running inside Docker as you can pass the secrets through the environment! That meand that you’ll need to setup a few environment variables in a .env file.

If everything is setup properly you should be able to run docker run --env-file .env -it --rm your-dbt-image-name:tag locally.

2. Setting up the Cronjob

Next step is telling Kubernetes to run that command periodically. To do that, simply create a new Cronjob resource YAML file (cronjob.yaml) pointing to the Docker Image you created and pushed to a registry.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: your-dbt-project
spec:
  schedule: "0 16 * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: your-dbt-project
            image: username/your-dbt-image-name:tag
            env:
              - name: REDSHIFT_DB_NAME
                valueFrom:
                  secretKeyRef:
                    name: redshift
                    key: database
              ...
              ...
            imagePullPolicy: Always
          restartPolicy: OnFailure

If you don’t have the secrets created in the cluster, you’ll also need to create the proper resource.

Running kubectl apply -f cronjob.yaml will create a Cronjob in Kuberetes. That cronjob will take care of running dbt once a day (exactly at 16:00).

There are a couple of interesting options to keep in mind:

  • restartPolicy: This will make Kubernetes restart the job (that runs all the models) in case of unhandled errors. You can read more about dbt exit codes in the official docs.
  • imagePullPolicy: With this setting, Kubernetes will always pull the image from the container registry and ignore the cached one. This makes deploying as easy as pushing the Docker image to the registry!

3. Adding CI/CD

This step is optional but recommended unless you want to push the image manually each time. There are numerous CI/CD vendors so I’ll skip the implementation details. These are the things it should take care after each change to the repository master branch.:

  1. Build a new Docker image with the same name and tag
  2. Upload the image to the registry (overwriting the existing one)
  3. (Optional) Test the models
  4. (Optional) Update the Cronjob resource in Kubernetes. This is helpful when changing the schedule or base image

With the previous setup, at any point in time, you’ll only have a Docker Image for each dbt project. Rolling back can be done just reverting the code changes that triggers the continuous integration pipeline.

4. Monitoring and Alerting

As with CI/CD, there are a vast number of vendors and each one might need a custom implementation. That said, there’s one thing almost all of them will need. Structured logging.

Calling dbt from the command line prints out some custom messages. There are really handy for the human eye but not that useful for machines to parse. Luckily, dbt has an internal API we can use. Once we’re able to write our own logs we’ll hook to any monitoring or alerting vendor.

This is an example showing how a simple script (run.py) could look if you want to write custom logs using the internal dbt API:

import dbt
import dbt.main
import dbt.logger
import logging
import json

logger = logging.getLogger("dbt")
logger.setLevel(logging.FATAL)

logging.info("DBT Run Started")

results, success = dbt.main.handle_and_check(["run", "--profiles-dir", "profile"])

for res in results:
    name = res.node.get("alias")
    if res.errored:
        error = res.error
        logging.error(f"Model {name} errored! Error: {error}")
    else:
        time = int(res.execution_time)
        logging.info(f"Model {name} completed in {time} seconds")

logging.info("DBT Run Completed!")

After that, you’ll need to change the Dockerfile and execute CMD ["python", "run.py"] instead of just dbt run.


That’s all! If you have any feedback or question please feel free to reach me. :smile: