For some time we’ve run our dbt project using Databricks workflows, but finally decided to take the plunge and build our own dbt-runner on Azure. The Databricks dbt support - while extremely useful as a starting point - offers little in the way of customizing your dbt jobs. There’s also the issue of cost, as our workload would require us to stand up a spark cluster for dbt in Databricks 24/7-365, since we need to run a subset of our dbt models near-real-time.
We have implemented dbt in Python based containerized Azure Functions and hosted on dedicated Linux app service plans. We Package all dependencies including the dbt project itself inside the container image. The dbt-runner exposes a REST API for running dbt-commands and and uses an asynchronous request-reply pattern.
We use the Durable Functions extension to Azure Functions, as they are well suited to implement the asynchronous pattern as well as supporting singleton workloads in addition to a bunch of other useful features.
Once you have the containerization and Function App provisioned, it’s pretty easy to get some basic durable functions up and running. Especially if you study some of the samples on the Microsoft documentation.
We use programmatic invocation of dbt as described here.
Architecture
We have provisioned two Function Apps in two different regions for redundancy purposes. We use a traffic manager to direct traffic at the primary endpoint, and only when it becomes non-responsive do we direct traffic to the paired region.
We have configured each Function App to use a ZR storage account within each region. In the primary region we run to B2 tier worker instances on the dedicated app service plan, while we only run a single worker instance on the paired region.
Deployment
We use an Azure DevOps pipeline with Docker@2 build task to build the image and publish it to an ACR. We base our image off a standard Microsoft image on dockerhub for python based Azure Functions.
We also generate a databricks token as part of the build pipeline (suing Databricks CLI), and write it as a secret to a key vault. We read the secret from within the function and set an environment variable during runtime with the token value - along with the other connection related details such as hostname, and http-path of the Databricks SQL warehouse.
This is made possible by referencing connection details using jinja expressions rather than hardcoded values in the profiles.yml file packaged in the Docker container.
Project Structure
Profiles.yml
data_platform:
target: default
outputs:
default:
catalog: null
host: "{{ env_var('DBT_DATABRICKS_HOST') }}"
http_path: "{{ env_var('DBT_DATABRICKS_HTTP_PATH') }}"
schema: "{{ env_var('DBT_DATABRICKS_SCHEMA') }}"
threads: 6
token: "{{ env_var('DBT_DATABRICKS_TOKEN') }}"
type: databricks
Dockerfile
FROM mcr.microsoft.com/azure-functions/python:4-python3.10
# Set environment variables for databricks connection
ARG hostName
ARG httpPath
RUN echo DBT_DATABRICKS_HOST: ${hostName} && \
echo DBT_DATABRICKS_HTTP_PATH: ${httpPath}
# Set general environment variables
ENV AzureWebJobsScriptRoot=/home/site/wwwroot \
DBT_DATABRICKS_HOST=${hostName} \
DBT_DATABRICKS_HTTP_PATH=${httpPath} \
AzureFunctionsJobHost__Logging__Console__IsEnabled=true
COPY . /home/site/wwwroot
COPY profiles.yml /home/site/wwwroot/dbt
WORKDIR /home/site/wwwroot
# 1. Install dependencies from requirements.txt file
# 2. Create empty dbt_packages folder in root folder
RUN pip install --no-cache-dir -r requirements.txt && \
mkdir dbt_packages
WORKDIR /home/site/wwwroot/dbt
RUN dbt deps
# Set environment variables for dbt folders
ENV DBT_PROJECT_DIR=dbt \
DBT_PROFILES_DIR=dbt
# Set working dir to the root folder.
WORKDIR /
Requirements.txt
# Do not include azure-functions-worker in this file
# The Python Worker is managed by the Azure Functions platform
# Manually managing azure-functions-worker may cause unexpected issues
azure-functions
azure-functions-durable
azure-identity
azure-keyvault-secrets
dbt-databricks==1.7.7
I made a walkthrough in a LinkedIn article (hope it’s ok to post a link here), which also includes code samples for the Azure Functions.
Deploy dbt-core Workloads on Azure Using Durable Functions | LinkedIn