Dbt Airflow on Google Cloud Composer -

I am testing a deployment of dbt within Cloud composer. On my local machine (Ubuntu 20.04) I have got success in running the dbt models with airflow. When running on Google Cloud composer I get the following error

{subprocess.py:74} INFO - Output:
{subprocess.py:78} INFO - Running with dbt=0.21.0
{subprocess.py:78} INFO - Encountered an error while reading the project:
{subprocess.py:78} INFO -   ERROR: Runtime Error
{subprocess.py:78} INFO -   at path ['name']: 'jaffle_shop' does not match '^[^\\d\\W]\\w*$'
{subprocess.py:78} INFO - 
{subprocess.py:78} INFO - Error encountered in /home/airflow/gcs/dags/dbt_project.yml
{subprocess.py:78} INFO - Encountered an error:
{subprocess.py:78} INFO - Runtime Error
{subprocess.py:78} INFO -   Could not run dbt
{subprocess.py:82} INFO - Command exited with return code 2
{taskinstance.py:1503} ERROR - Task failed with exception

We are using a BashOperator to run dbt models in Airflow.
Initially had some problems with dependencies but they were solved.
Using a standard dbt_project.yml file with a single model just to test how this works.
Another way is to use Docker but we need try if this works.

Editstrong text

Versions

dbt: 0.21.0

cloud-composer: 1.17.1

airflow: 2.1.2

Pypi Packages
airflow-dbt: 0.4.0

dbt: 0.21.0

jsonschema: 3.1 (Added this as Pypi was throwing an error about the version

I really appreciate any help.

Pete.

Hi there,
I’m not sure what the reason is for your error but this is my experience. Hope this helps in a way.
When we used the bash operator we ran into issues with dependencies issues between composer, airflow and dbt. We ended copying our dbt projects into a folder, in the airflow GCS bucket.
We had one folder for many dbt projects. The bash operator ran a dbt_run.sh, passing the needed parameters:

  • target
  • models
  • vars
  • full_refresh
  • repo_name (dbt project name)

The .sh used a Makefile to build the dbt run commands and builds a virtual environment to bypass the dependencies issues. DISCLAIMER: I wouldn’t suggest using running a virtual environment in composer since it’s not approved by Google. We only found out much later and are in the process of changing this. Since you’ve solved the dependencies issues this might still be useful.
Here’s the basic python:

call_dbt_run = BashOperator(
    task_id='call_dbt_run',
    bash_command=f"/home/airflow/gcs/data/dbt-projects/dbt_run.sh {dbt_target} {dbt_model} {dbt_vars} {full_refresh} {repo_name}",
	on_success_callback=log_dbt_success,
	on_failure_callback=log_dbt_failure,
    dag=dag,
)

Here is the .sh:

#!/bin/bash
dbt_target=$1
dbt_models=$2
dbt_vars=$3
full_refresh=$4
repo_name=$5
suffix=$(date +%s$RANDOM)

echo "Start creating virtualenv: dbt_env$suffix"
virtualenv /tmp/dbt_env$suffix --python=python3.6.10
echo "Finish creating virtualenv: dbt_env$suffix"

cd /home/airflow/gcs/data/dbt-projects/

# copy the project repo code to a local temp folder
mkdir -p /tmp/$suffix
cp -r $repo_name /tmp/$suffix
ls /tmp/$suffix/$repo_name

echo "Start activating virtualenv and installing packages: dbt_env$suffix"
source /tmp/dbt_env$suffix/bin/activate && /tmp/dbt_env$suffix/bin/pip install -r ./requirements.txt
echo "Finish activating virtualenv and installing packages: dbt_env$suffix"

if [ "$full_refresh" = "True" ]; then
  make run-full-refresh DBT_TARGET="$dbt_target" DBT_MODELS="$dbt_models" DBT_VARS="$dbt_vars" REPO_NAME="$repo_name" SUFFIX="$suffix"
else
  make run DBT_TARGET="$dbt_target" DBT_MODELS="$dbt_models" DBT_VARS="$dbt_vars" REPO_NAME="$repo_name" SUFFIX="$suffix"
fi

# capture the exit code from the dbt run command
# so that the final exit code form removing virtualenv cmd doesn't get used by BashOperator
exit_code=$?

echo "Start removing virtualenv: dbt_env$suffix"
rm -rf /tmp/dbt_env$suffix
echo "Finish removing virtualenv: dbt_env$suffix"

if [ $exit_code -ne 0 ]; then
    echo "exit code $exit_code"
fi

# remove local temp folder
echo "Start removing local dbt folder /tmp/$suffix"
rm -rf /tmp/$suffix
echo "Finish removing local dbt folder /tmp/$suffix"

# rethrowing the exit code to BashOperator
exit $exit_code

And here’s the Makefile:

.ONESHELL: 
# special target: .ONESHELL instructs make to invoke a single instance 
# of the shell and provide it with the entire recipe, 
# regardless of how many lines it contains.

# the /home/airflow/gcs/data/dbt-projects/$repo_name will have the models,
# and appropriate dbt files for a particular dbt job object.
# cd $repo_name

debug:
	cd /tmp/$(SUFFIX)/$(REPO_NAME)
	dbt debug --profiles-dir=./profiles_dir

debug-local:
	dbt debug --profiles-dir=$(local_path)/profiles_dir

run:
	cd /tmp/$(SUFFIX)/$(REPO_NAME)
	dbt run --target=$(DBT_TARGET) --models=$(DBT_MODELS) --vars '$(DBT_VARS)' --profiles-dir=./profiles_dir

run-local:
	dbt debug --profiles-dir=$(local_path)/profiles_dir

run-full-refresh:
	cd /tmp/$(SUFFIX)/$(REPO_NAME)
	dbt run --target=$(DBT_TARGET) --models=$(DBT_MODELS) --vars '$(DBT_VARS)' --profiles-dir=./profiles_dir --full-refresh

test:
	echo "$(DBT_TARGET)"

We also use the KubernetesPodOperator in our composer environment (just make sure whether Workload Identity will work in your environment).
Here’s the discourse article with the explanation: Running dbt in Composer using a KubernetesPodOperator

Hi meurant,

Thanks for the detailed response. I guess the the approach is isolating the environment to run dbt and avoid any conflicts between composer and dbt, whether using virtual environment or Docker with KPO.

As you mentioned that the virtual environment is not recommended approach, I suppose the KPO is the way forward. I did read the article on Kubernetes Pod Operator, and is well written to run dbt in Docker/KPO. Will try this approach and check how this works.

Regards
Pete

Hi there,
I forgot to add that we are looking into using Cloud Run to execute dbt jobs.
I’ll add more info when I can.
Cheers,
Meurant

1 Like

Hi ! Thank for all your time and your work. I read your article about Running dbt in Composer using a KubernetesPodOperator and this will help me a lot. So I was wondering which part Cloud Run would replace in your previous article. (I also have another question about managing your docker images in CI/CD but I don’t want to take up your time.)

Thank you for your time and have good day,

Best regards,
Emmanuel.

Hi,
At the moment, We are just in initial phase and testing the waters. However dbt was extracted out of composer and put into GKE as a REST endpoint. At the moment its working. We are starting to put heavy loads onto that and see various metrics that matter us the most. We use Jenkins at the moment, but can change soon. So cannot comment much on that.
Thanks.
Pete

Can you expand on this part and how you figured it out?
I definitely believe it - GCP seems picky about what gets installed on Composer instances and how you do it, but I don’t see much in the docs warning about not using virtualenvs.

Our Cloud Services (security, infra, etc.) team ran checks and our work popped up in there list.
I’m not sure what the checks were

There’s a bit that changes. In short:

  • A flask app was built that executes dbt commands.
  • The app is called from Cloud Run.
  • Cloud Run is triggered by composer.

Unfortunately I don’t have more details but I’ll create s discourse article when I do.

Run the task with DbtRunOperator task on gcp composer

@minhajpasha Same issue here, DBTRunOperator was no help.

It’s not the issue with composer or dbt. I can see the dbt starts executing, but due improper configuration parameters in files it’s unable to parse the profile.yml

If required let’s connect over the call , so that I can guide you. Can you please share profiles.yml & dbt_project.yml files

@minhajpasha , Very much appreciate the offer of help & a call! But I did manage to track down the ultimate issue.

My Stack Overflow response is over here, but for anyone else coming across this:

After much digging, the source of the problem here is the jsonschema dependency. Version 3.1.0 does not work, while versions 3.1.1 and 3.2.0 will work–and should also work within Composer’s dependency requirements.

There looks to have been an issue with switching to js-regex for the jsonschema folks in 3.1.0, which caused them to revert back to regular re in 3.1.1.

There are some details here , and a couple of related issues described there as well (sorry, I can only post 2 links as a new user).

In general, it would be much nicer if Cloud Composer supported virtual environments to avoid this entire dependency-collision mess, but apparently Google does not support that approach.