We populate our raw database using Fivetran. Since dbt automatically hardcodes the columns in a view definition, our staging views break every time a new column is added to the underlying table, with, for example: View definition for 'ANALYTICS_DB.STG_SCHEMA.STG_VIEW' declared 16 column(s), but view query produces 19 column(s).
The context of why I’m trying to do this
I’m trying to figure out a way to not have to re-run all staging views every time Fivetran finishes syncing to prevent this.
What I’ve already tried
I’ve been looking at dbt-coves and dbt-osmosis. I think I can get things to work ok with a decent amount of modification and combination, but a) it seems like way more work than should be required for something like this, b) the performance is terrible, etc.
Some example code or error messages
Model ANALYTICS_DB.STG_SCHEMA.STG_VIEW is defined as
SELECT
*
FROM
FIVETRAN_DB.SCHEMA.TABLE
But gets compiled to
CREATE OR REPLACE VIEW ANALYTICS_DB.STG_SCHEMA.STG_VIEW AS
(
SELECT
COL1
, COL2
, COL3
FROM
FIVETRAN_DB.SCHEMA.TABLE
);
What adapter are you using? I ask because when I look in target/run for both my test (duckdb) project I don’t see it enumerate the columns like your example shows.
I had a play with Snowflake, and this is not actually dbt doing this, but instead how Snowflake stores the resultant created view. I did the same thing as above and the code in /target/run includes the select * but I pointed it to a test table, which I then added a column to and got the error you’re seeing.
What dbt ran:
create or replace view db.schema.test
as (
select
*
from db.scratch.jd_table
);
What snowflake stores as the definition of the view:
create or replace view DB.SCHEMA.TEST(
FOO,
BAR
) as (
select
*
from db.scratch.jd_table
);
So when the underlying table adds a column the view fails.
You can automate adding the columns in using dbt_utils.star() which will mean your views don’t fail when columns are added, and also will update without you having to touch the code on your next dbt run.
All of that said, my subjective opinion is that staging models are for renaming, casting types, etc. Essentially creating an endorsed view of your source data. With that framing, I prefer to manually enumerate the columns (using codegen to help initially). Otherwise, you may as well not have staging models at all and just select directly from the source.
That was awesome of you to dive in there. I should’ve looked at the compiled code myself, instead of just assuming that it was dbt doing this rather than Snowflake.