BigQuery tables or views to expose data?

Hey everyone :wave:

I am wondering if you use tables or views to expose data in BigQuery, and if so why?

1) Context

We use dbt + BigQuery in a data mesh setup, where our data products expose one or multiple public datasets.

Our current guideline is to expose views, so our data products are often structured like this:

  • Landing layer (raw incoming data)
  • Private layer (transformed / refined / joined data)
    • work dataset with views for normalization
    • core dataset with tables (often incremental models) where data resides
  • Public layer (exposed data for consumers)
    • Views on top of the core tables
    • Often multiple public datasets for core data products:
      • public dataset with all brands → SELECT * FROM core.table
      • brand-specific datasets (public_brand1, public_brand2, …) → SELECT * FROM core.table WHERE brand = '...'

But we are wondering if we should encourage the usage of tables instead of views. So the core table (incremental model) would be moved to the public layer.

2) Advantages of tables

  • Preview works in the BQ UI (not possible with views), allowing to see what the data looks like.
  • Partition / clustering info is available in the BQ UI (for views, need to go look at the underlying table, and consumers don’t have access to it, since the core layer is private).
  • BQ UI automatically adds a “WHERE partition filter” in the query editor (avoids doing a full scan for lazy analysts).
  • Policy tags visible on tables, not views (we use those for access control to privacy-sensitive columns).
  • Can use time travel, snapshots, and cloning (useful for debugging downstream data product incidents, to see the state of upstream data product when it last ran).
  • Can use table sampling (for dev env).
  • No need for duplicate dbt models/YAML (core incremental + public view).
    • Maintaining the .yaml files is especially annoying, because they are very similar (same column types and descriptions), but we don’t want to define data tests both in core and public yaml, to avoid paying twice). So we can’t simply copy paste.
  • No need to deal with authorized views complexity (a special GitHub action applies permissions and authorizes the view, but we lose it if we re-create the core table, so need to remember to re-run the GitHub action).

3) Advantages of views

  • Higher storage + ETL cost if we materialize one table per brand dataset (because the data will be in the “public” table but also in the “public_brandX” table). With views, we store the data only once, in the core table.
    • But it’s also not that clear. Because, if we use views, and don’t cluster on the brand column, then a small brand will pay for reading the data for all brands. So we will reduce storage / ETL cost in our DP, but increase costs for our consumers.

:point_right: What’s your experience? Do you default to tables or views for the public layer of your data products?

:point_right: Do you see other advantages / disadvantages of tables vs views?
:point_right: How do you handle the trade-off when you need to publish the same entity into multiple brand datasets?

Thanks!

I found a disadvantage of using tables for the public layer:

If our public tables are materialized using incremental models with merge strategy and on_schema_change, dbt will create xxx__dbt_tmp tables for the merge, that will expire after 12 hours (see issue).
Problem is, not only are those temp tables polluting the dataset visually, but also, they don’t contain policy tags like the standard tables. And we use policy tags to restrict access to some specific columns…
So this means some consumers could read some columns they shouldn’t have access to, for the recent rows incrementally added during the last run…
So I need to see if we can override the materialization, to create those temp tables in another dataset.