The problem I’m having
I’m looking for some advice on the best way to manage sources / staging models from multiple regions in BigQuery while the primary analytics models are in us-west1 in BigQuery
The context of why I’m trying to do this
I am currently running dbt core in Github Actions to do daily/hourly builds in BigQuery
Most raw data sources I am working with are loaded into us-west1 and the core data models are all built in a dataset in us-west1 for BI & rETL, however we started to have more sources from other regions in BigQuery, for now they are mostly due to:
- Use of external tables & BigQuery Omni which maybe connect to blob storage from all other cloud/regions
- Access to BigQuery datasets that sits in a different region in another GCP project
So essentially we have one main dbt project for one analytical BigQuery instance in a single analytical specific GCP project – but we are getting sources from other regions and BQ dataset in other projects that I’m hoping to manage all in one place, ideally through a single dbt project where we can see all the movements and lineage
What I’ve already tried
Today I have targets in profiles.yml that looks like below.
prod:
dataset: analytics
location: us-west1
dev:
dataset: dbt_dev
location: us-west1
azure-eastus2:
dataset: analytics_staging_us
location: azure-eastus2
For scheduled prod builds, github action basically does below steps:
- Run the dbt-external-table package to update any changes, this includes all external tables both from standard GCS blob + BigQuery Omni Connections
dbt run-operation stage_external_sources --target azure-eastus2
– An example of the external table source spec:
version: 2
sources:
- name: azure_eastus2
schema: azure_eastus2
tables:
- name: logs
external:
location: azure://..../*
options:
format: CSV
skip_leading_rows: 1
connection_name: <omni_connection_name>
- Build models from each region into a single analytics_staging_us (multi-region US) using the dbt target configured for that specific region, the models here are basically doing few things incrementally 1) read from ext. table 2) do some basic agg & parsing 3) load into a physical table in
USregion
dbt build --target azure-eastus2 --select tag:azure-eastus2
The entireanalytics_staging_usis setup with secondary replication tous-west1in order to be accessible by core models, so we want to keep stuff here as aggregated as possible. - Run the main analytical build in us-west1 which reads from the physical table in
analytics_staging_usas raw source
dbt build --target prod --select +tag:daily
So for Omni the dataflow goes:
Azure:eastus2 (or any other US region) —[incremental materialization from ext. tbl]— > GCP:US —[dataset replication]—> GCP:us-west1
The problem with this approach:
- The dbt graphs are disconnected at the replication step, the materialized table from each region into
analytics_staging_usare now raw dbt source in the main analytical models (instead of refs) dbt docs generatewill now error without--empty-catalogbecause it can only use a single target when compiling and scanning the catalog, we have some sources that are not reachable by a single target.
I think the fundamental limitation is that I can’t tell dbt to use multiple targets when doing something, or configure multiple session location when defining a target, but I’m not sure if I am missing anything in dbt that can handle this scenario gracefully