Everything
Usage statistics
As you won’t have access to the internet anyway, it is best to turn off anonymous usage statistics. They actually take a few seconds to realise they don’t have internet access, so save that time.
config:
send_anonymous_usage_stats: False
Containers
For many solutions, they will be some variant of “put a container in X”. Containers have all those great containery features, such as being able to deploy the same container anywhere etc. But for our locked down focus they also give us a way to pull dependencies. When creating your container image, gather all your dependencies at this point. This will mean you have nothing to do at runtime except call dbt.
Inside the running container you could also access environment variable or keystores/parameter stores to get the remaining configuration you need.
Here is part of an example dockerfile.
# install dbt and only the adapter you need
RUN pip install dbt-core==0.14.2
RUN pip install dbt-postgres==0.14.2
# copy whatever scripts you need to run
COPY ./container_assets/*.sh /home/local/
RUN chmod +x /home/local/*.sh
# copy your dbt profile
RUN mkdir -p /home/local/.dbt
COPY ./container_assets/profiles.yml /home/local/.dbt
#Grant user permissions and switch to that user
RUN chown -R local /home/local
USER local
WORKDIR /home/local
# copy your project directory
RUN mkdir -p /home/local/transform
COPY ./transform /home/local/transform
# run dbt deps to get dependencies
RUN cd /home/local/transform && \
dbt deps
# on startup the pipeline runs
CMD ./dw_pipeline.sh
AWS
Lambda
Pretty easy no for Lambda, if your code runs for longer than 15 minutes, it gets killed. No way to work around this.
AWS Glue
Seems like a very expensive way to run a long running script. @martin also mentions space issues, which means there may be a limitation on how big your scripts (which includes dbt) can be.
ECR + AWS Batch
Store your job in a container on ECR, run it on ECS using AWS Batch.
Batch is a wrapper around ECS that is meant to allow you to more easily fire off jobs, and handle things like jobs with parallel tasks, job prioritization etc. This is all very overkill given we are only looking to run 1 job at a time. You will also need to put a lambda + scheduler in front of this to fire off the job however often you want. And you will need to query the api to make sure the last job finished before firing off a new one, which can be clunky to do with the batch api.
So why not just cut out the middle man?
ECR + Lambda
Store your job in a container on ECR, run it on ECS using a scheduled lambda.
The ECS and AWS Batch apis are very similar, and you still end up doing the heavy lifting of scheduling yourself anyway, so might as well just call the ECS directly and have 1 less cog in the system that can go wrong. This approach is very easy, can be automated, requires no internet access, and is cheap.
ECS Side note - Fargate vs EC2 Cluster
General rule is, if you can fill up an entire EC2 Cluster with containers, it is probably cheaper than Fargate. But that comes at the cost of having to manage the host vms yourself. And for dbt, you have 1 container that runs infrequently that barely needs any resources at all.
Rough Fargate costs: 6 hour run of DBT with 0.25 vCPUs and 0.5 GB ram (yes dbt will happily run with that) and you are looking at $0.08 per run. Small enough to be a rounding error compared to your data warehouse infrastructure costs.
Azure
I am not active on Azure at the moment, though I would love to see someone try and put dbt on Azure Functions using their premium plan, which removes the execution time limits.