Publishing DBT Docs from a Docker Container


#1

First, allow me to acknowledge that this is overkill for most. I want my DBT docs to be accessible to everyone in my business, but not to everyone else. I could do this with Sinter of course, but 300 employees @ $20 per seat per month is a bit too rich for my blood.

@peter_hanssens wrote a great piece here about how to use Netlify to serve your docs. But to password protect that content runs you $9 per user per month. Less spendy than the native Sinter option, but still more than I want to spend.

My team already uses docker for a ton of stuff, and we already have infrastructure in AWS that runs docker containers, so I decided to go that route.

Dockerfile

FROM python:3.6

ARG user=someUser
ARG organization=yourGitHubOrg
ARG repo=yourDBTRepo

ARG homedir=/home/${user}
COPY entrypoint.sh ${homedir}

# Non-root group & user creation
RUN groupadd -r ${user}
RUN useradd -r -m -g ${user} ${user}
RUN mkdir ${homedir}/.ssh

# Git
ENV REMOTE_REPO git@github.com:${organization}/${repo}.git
ENV REPO_DIR ${homedir}/${repo}
RUN apt-get update && apt-get install -y git
COPY id_rsa ${homedir}/.ssh/
RUN ssh-keyscan github.com >> ${homedir}/.ssh/known_hosts

# BigQuery
ENV GOOGLE_APPLICATION_CREDENTIALS ${homedir}/service_account.json
COPY service_account.json ${homedir}

# Permissions!
RUN chmod 0700 ${homedir}/.ssh
RUN chmod 0600 ${homedir}/.ssh/id_rsa
RUN chmod 0644 ${homedir}/.ssh/known_hosts
RUN chmod 0755 ${homedir}/entrypoint.sh
RUN chown -R ${user}:${user} ${homedir}/.ssh

# DBT!
RUN pip install dbt==0.11.1

# Prep for container execution
USER ${user}
WORKDIR ${homedir}
ENTRYPOINT ["/bin/bash", "entrypoint.sh"]

Note: You may need to tweak this to enable a connection to your data warehouse as specified in your profiles.yml file.

entrypoint.sh

#!/usr/bin/env bash
git clone $REMOTE_REPO
cd $REPO_DIR
dbt deps --profiles-dir .
dbt docs generate --target prod --profiles-dir .
dbt docs serve --profiles-dir . > /dev/null 2>&1 &
while [ True ]
do
    sleep 600
    if [ `git rev-parse --short HEAD` != `git rev-parse --short origin/master` ]; then
        git fetch --all
        git reset --hard origin/master 
        dbt deps --profiles-dir .
        dbt docs generate --target prod --profiles-dir .
    fi
done

Note: I have my profiles.yml file in the root of my repo.

This will clone your dbt repo, install dependencies, generate docs, start the webserver in the background, then enter a loop where every 10 minutes it will check if there’s been a change to the code on github, and if so pull down the new code and regenerate the docs files.

Build the image from a location where your authentication files are present, so they can be copied into your image. I have a little EC2 instance for just this purpose, so I don’t have service account credentials proliferating onto a bunch of engineer laptops. From the desired folder, I run docker build --pull -t dbtdocs .

I can deploy that image to ECR if I want, but I run it from the image build machine (it’s idle most of the time anyway; might as well give it something else to do!) To start serving the site, I run docker run -d -p 8080:8080 --restart unless-stopped dbtdocs

Assuming your network settings allow it, now you can hit port 8080 on that machine and your docs site should be visible.

I went a step further, however, to SSL-ify the connection. I use a EC2 HTTPS load balancer (because it makes it easy to deal with certificates and whatnot) listening on port 443 (as expected) and it routes all traffic to the docker host’s 8080, where the docker host will then route it to the docker container’s 8080, where dbt docs generate is serving the content. My security group settings in AWS disallow anything except 443 from outside (and only allowing 443 from office IP addresses), but allow 8080 on the internal network.


#2

Some known weaknesses:

  • I have to maintain the hardware for this.
  • The first time through doing the load balancer and security group config was a bit hairy - I’ve done it a few times now, so it’s not a big deal now, but that first time wasn’t trivial. A better engineer than me would make a CloudFormation template for this.
  • Expects that I have a subdomain that I can point to the load balancer (since the subdomain is what the SSL cert is applied to). I do but that’s not a given for some organizations.
  • It doesn’t do user auth, relying on the doubleplusungood assumption that anyone on my office wifi is authorized to see the docs. I hope to add an oAuth portal at some point. Until then, I just have to be mindful of what I put in the docs.