Publishing dbt Docs from a Docker Container

First, allow me to acknowledge that this is overkill for most. I want my DBT docs to be accessible to everyone in my business, but not to everyone else. I could do this with Sinter of course, but 300 employees @ $20 per seat per month is a bit too rich for my blood.

@peter_hanssens wrote a great piece here about how to use Netlify to serve your docs. But to password protect that content runs you $9 per user per month. Less spendy than the native Sinter option, but still more than I want to spend.

My team already uses docker for a ton of stuff, and we already have infrastructure in AWS that runs docker containers, so I decided to go that route.

Dockerfile

FROM python:3.6

ARG user=someUser
ARG organization=yourGitHubOrg
ARG repo=yourDBTRepo

ARG homedir=/home/${user}
COPY entrypoint.sh ${homedir}

# Non-root group & user creation
RUN groupadd -r ${user}
RUN useradd -r -m -g ${user} ${user}
RUN mkdir ${homedir}/.ssh

# Git
ENV REMOTE_REPO git@github.com:${organization}/${repo}.git
ENV REPO_DIR ${homedir}/${repo}
RUN apt-get update && apt-get install -y git
COPY id_rsa ${homedir}/.ssh/
RUN ssh-keyscan github.com >> ${homedir}/.ssh/known_hosts

# BigQuery
ENV GOOGLE_APPLICATION_CREDENTIALS ${homedir}/service_account.json
COPY service_account.json ${homedir}

# Permissions!
RUN chmod 0700 ${homedir}/.ssh
RUN chmod 0600 ${homedir}/.ssh/id_rsa
RUN chmod 0644 ${homedir}/.ssh/known_hosts
RUN chmod 0755 ${homedir}/entrypoint.sh
RUN chown -R ${user}:${user} ${homedir}/.ssh

# DBT!
RUN pip install dbt==0.11.1

# Prep for container execution
USER ${user}
WORKDIR ${homedir}
ENTRYPOINT ["/bin/bash", "entrypoint.sh"]

Note: You may need to tweak this to enable a connection to your data warehouse as specified in your profiles.yml file.

entrypoint.sh

#!/usr/bin/env bash
git clone $REMOTE_REPO
cd $REPO_DIR
dbt deps --profiles-dir .
dbt docs generate --target prod --profiles-dir .
dbt docs serve --profiles-dir . > /dev/null 2>&1 &
while [ True ]
do
    sleep 600
    if [ `git rev-parse --short HEAD` != `git rev-parse --short origin/master` ]; then
        git fetch --all
        git reset --hard origin/master 
        dbt deps --profiles-dir .
        dbt docs generate --target prod --profiles-dir .
    fi
done

Note: I have my profiles.yml file in the root of my repo.

This will clone your dbt repo, install dependencies, generate docs, start the webserver in the background, then enter a loop where every 10 minutes it will check if there’s been a change to the code on github, and if so pull down the new code and regenerate the docs files.

Build the image from a location where your authentication files are present, so they can be copied into your image. I have a little EC2 instance for just this purpose, so I don’t have service account credentials proliferating onto a bunch of engineer laptops. From the desired folder, I run docker build --pull -t dbtdocs .

I can deploy that image to ECR if I want, but I run it from the image build machine (it’s idle most of the time anyway; might as well give it something else to do!) To start serving the site, I run docker run -d -p 8080:8080 --restart unless-stopped dbtdocs

Assuming your network settings allow it, now you can hit port 8080 on that machine and your docs site should be visible.

I went a step further, however, to SSL-ify the connection. I use a EC2 HTTPS load balancer (because it makes it easy to deal with certificates and whatnot) listening on port 443 (as expected) and it routes all traffic to the docker host’s 8080, where the docker host will then route it to the docker container’s 8080, where dbt docs generate is serving the content. My security group settings in AWS disallow anything except 443 from outside (and only allowing 443 from office IP addresses), but allow 8080 on the internal network.

2 Likes

Some known weaknesses:

  • I have to maintain the hardware for this.
  • The first time through doing the load balancer and security group config was a bit hairy - I’ve done it a few times now, so it’s not a big deal now, but that first time wasn’t trivial. A better engineer than me would make a CloudFormation template for this.
  • Expects that I have a subdomain that I can point to the load balancer (since the subdomain is what the SSL cert is applied to). I do but that’s not a given for some organizations.
  • It doesn’t do user auth, relying on the doubleplusungood assumption that anyone on my office wifi is authorized to see the docs. I hope to add an oAuth portal at some point. Until then, I just have to be mindful of what I put in the docs.

Thanks for sharing this. Very helpful to see how you serve and then loop over generating the files, nice.

The service_account.json is used for accessing Bigquery, I’m wondering if there is an alternative to baking the secret into the container. I think one of these GCP options would let you apply service account permission to the container without requiring the secret to be in the container as a file, https://cloud.google.com/container-options/ also, the DW is in GCP so why not host the docs there.

I’m also thinking about authentication, you’re already using a load balancer; it is possible to set up the load balancer to do auth with Cognito e.g. https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-authenticate-users.html#cognito-requirements
Cognito supports a bunch of different SSO integrations, would be worth looking at.

Just wanted to chip in here and describe my train of thoughts and solution.

In the docs section, you can see 4 possibilities, presumably ranked by the authors’ preference.
My personal ranking is :

  1. dbt cloud or netlify: fully managed CI/CD option but you have to pay at one point or another
  2. “spin up a webserver like nginx”: in other words deploy the static website the way you would deploy any other static website (think sites generated by frameworks like jekyll or hugo)
  3. S3 bucket

In a case where pricing is an issue and engineering resources/knowledge are available, I even would swap 2 for 1.

In the situation you want to achieve 2, you need 3 things:

  • a docker image that can build and serve the static website
  • a CI that can build the docker image on any new commit and push it to a repository
  • an “environment” in which to deploy your docker image: whether it is a server on which you pull the image, or kubernetes if you have a cluster setup

Let’s assume you have the last 2 steps ready from your other production services, and let’s focus on the first step: build and serve the static website.

The best way I’ve found to do this is to leverage docker multistage builds. In other words, you first build and then serve. The image to build your website is a python based image with dbt installed, that will run dbt docs generate. The image to serve your website will be a web-server based image (in my case I use nginx) that copies over the static files generated in the previous step and serves them.

For the Dockerfile itself, there are a few specificities to each project to I can’t share a turnkey Dockerfile, but here is the structure:

# build site
FROM python:3.7-slim as dbt_site_builder

# here add the environment variables you need and will pass via --build-arg 
ARG ...  

# install dbt
ENV DBT_VERSION=0.14.2
RUN apt-get update -y && \
    apt-get install --no-install-recommends -y -q \
    git libpq-dev python-dev && \
    pip install dbt==${DBT_VERSION}

# Set environment variables and working directory
ENV DBT_DIR /source
WORKDIR $DBT_DIR
ENV DBT_PROFILES_DIR $DBT_DIR
COPY . .

# pull dbt dependencies
RUN dbt deps

# build static pages
RUN dbt docs generate

# serve site
FROM nginx:stable-alpine

# here copy any nginx related files you might need for your deployment, for example nginx.conf
ADD ...

COPY --from=dbt_site_builder /source/target/index.html /source/target/manifest.json /source/target/catalog.json /source/target/run_results.json /usr/share/nginx/html/

EXPOSE 80

You can then :

docker build -t dbt-docs --build-arg MY-ENV-VARIABLE  .
docker run -p 8080:80 dbt-docs
open http://localhost:8080

Hope this helps

2 Likes

Love it! Proper CD is far superior to the hack-job I did in the original post!