Pre-release: v0.16.0 (Barbara Gittings)

A new pre-release of dbt v0.16.0 (Barbara Gittings) is now available on PyPi and in dbt Cloud. The new and changed functionality in this pre-release is covered below. There are heaps of new features, performance improvements, and bugfixes in this pre-release, and we need your help testing them!

View the Changelog for the full set of changes implemented since 0.15.x

Installation:

# with pip
pip install dbt==0.16.0b3

In dbt Cloud:
Select 0.16.0 beta in the environment of your choice.

If you give the pre-release a spin, please follow up below with a comment indicating which features you tried out and if you had any issues running dbt. If you do encounter an issue with the pre-release, we’ll work to fix them before the final release of dbt v0.16.0 (estimated ~2 weeks from now)

Please note that documentation for the new and changed functionality is coming shortly! I’ll be updating this post with links to the relevant docs as they’re added.

New and changed (highlights):

Core

  • Add support for a generate_database_name macro
  • Rationalize csv parsing in seed files. Seed behavior should be much less surprising now
    • If a seed is configured with custom data types, that column will be parsed as a string to avoid errant agate type inference
    • Limit the types of date/time formats that are automatically parsed from strings. Sunday will no longer return the first Sunday after 0000-01-01 as before, which is good.
  • Improve speed of catalog queries (parallelizing smaller queries w/ push-down filter logic). This should ameliorate issues Snowflake and BigQuery users were seeing about compilation memory becoming exhausted, or the information schema returning too much data.
  • Support tags on sources, columns, and tests, eg:
# schema.yml

version: 2

models:
  - name: my_model
    columns:
      - name: id    
        tags: ['my_col_tag']
        tests:
          - relationships:
              to: abc
              field: other_id
              tags: ["my_test_tag"]

----------------------------------------------------

# Run tests tagged with a specific tag
$ dbt test -m tag:my_test_tag

# Run all tests on columns tagged with a specific tag
$ dbt test -m tag:my_col_tag

Docs

Note: These changes are not yet released in the 0.16.0-b3 pre-release, but will be live in dbt 0.16.0-rc1

  • Support hiding models in the documentation website
  • Support meta keys in the documentation website
  • Support documentation on seeds, snapshots, macros, analyses, and custom data tests
  • Support column-level quoting in schema.yml files:
# schema.yml
version: 2

models:
  - name: model
    columns:
      - name: a-field-that-should-be-quoted
        quote: true

Snowflake

  • Parallelize filling the relation cache at the start of the run (a meaningful performance improvement)

Under the hood

  • Many changes to how dbt builds contexts to make compilation more consistent and scrutable.
    • Rationalize materialization override precedence (previously poorly defined)
    • Rationalize macro override precedence
    • See this PR for a summary of changes
  • Improve/automate dbt deployment flow, so we’ll be deploying many more pre-releases of dbt in the future!

BigQuery

  • Fix database errors when changing the partitioning or clustering configs for an existing table model
  • Support a cost-effective merge operation in incremental models and enhanced partition configs (more on this to come!). Example usage:
{{ config(
  materialized='incremental',
  partition_by={"field": "collector_tstamp", "data_type": "date},
  unique_key="event_id"
) }}

select * from snowplow.event
{% if is_incremental() %}
-- _dbt_max_partititon is injected into the query using BQ Scripting
where collector_tstamp > _dbt_max_partition
{% endif %}

Specific questions:

  • Snowflake users: do you find that startup time is faster than before?
  • BigQuery users who previously encountered issues with dbt docs generate - are these issues now fixed?
  • Is anything that was previously working now broken? If so, what?
2 Likes

On dbt run
I’m getting this error for 3 models that have partition configs:

Compilation Error in macro bigquery__get_merge_sql (macros/materializations/merge.sql)
  macro 'dbt_macro__bigquery__get_merge_sql' takes not more than 4 argument(s)

  > in macro adapter_macro (macros/adapters/common.sql)
  > called by macro get_merge_sql (macros/materializations/common/merge.sql)
  > called by macro bq_partition_merge (macros/materializations/incremental.sql)
  > called by macro materialization_incremental_bigquery (macros/materializations/incremental.sql)
  > called by macro bigquery__get_merge_sql (macros/materializations/merge.sql)

I did see the deprecation warnings, but it sounds like dbt would be smart enough to infer the correct partition:

* Deprecation Warning:  As of dbt v0.16.0, the `partition_by` config in BigQuery
accepts a dictionary containing `field` and `data_type`.
 - Provided partition_by: date(session_start_timestamp_pst)
 - dbt inferred: {'field': 'session_start_timestamp_pst', 'data_type':
'timestamp'}

Thanks for the report @claus! Do you know if you have a custom override of bigquery__get_merge_sql in your project? We removed this macro from the dbt-bigquery, so I think that if you’re seeing this error still, it’s because it lives in your project. Let me know if that’s not the case!

@drew definitely didn’t override merge in this or any project…I also just checked that dbt clean doesn’t help here.
Anything else I can do to help debug this?

Confirmed that I did install the latest dbt-bigquery

Attempting uninstall: dbt-bigquery
    Found existing installation: dbt-bigquery 0.15.2
    Uninstalling dbt-bigquery-0.15.2:
      Successfully uninstalled dbt-bigquery-0.15.2
...
Successfully installed cffi-1.13.2 dbt-0.16.0b3 dbt-bigquery-0.16.0b3 ...

dbt v0.16.0-rc2

The second release candidate for dbt v0.16.0 is now available on PyPi, Homebrew, and in dbt Cloud.

Installation

# With Homebrew
brew install dbt@0.16.0-rc2
brew link --overwrite dbt@0.16.0-rc2

# With pip
pip install dbt==0.16.0rc2

Changelog from the last beta:

  • Fix an issue with the pip deployment on BigQuery (thanks for the heads up @claus!)
  • Pin cffi to <1.14 to avoid a version conflict with snowflake-connector-python
  • Remove the requirement to have a passphrase when using Snowflake key pair authentication
  • Changes to the docs website (screenshots below)
  • Support insert_overwrite materializtion for BigQuery incremental models :tada:
  • Use show terse schemas on Snowflake to make dbt bootstrapping faster!
  • Parallelize filling the Snowflake relation cache (speedup)

Docs changes

Render docs for macros

Render docs for analyses

Render docs for custom data tests

Support hidden models (not shown in file tree, but visible in DAG

Render model meta tags in model details

1 Like

@drew
This worked on BQ now!

However, some anecdotal comparisons here show, at least for this table, a slowdown in incremental update processing time, and an increase in bytes processed.

This could be a factor of this table not being that large, so this may get better as we scale, but maybe there is also something wrong in our approach.

  • We always process the last 3 days, for late arriving event data
  • Table is partitioned on session_start_timestamp, which is what we use to filter in the incremental where clause
Table size	
9.27 GB
Number of rows	
3,740,797
Partitioned by	
Day
Partitioned on field	
session_start_timestamp

The new BQ script looks good to me, it seems to correctly merge on both the unique_key as well as the partition field.

set (dbt_partitions_for_upsert) = (
      select as struct
          array_agg(distinct date(session_start_timestamp))
      from my_project.my_schema.events__dbt_tmp
  );
  
  -- 3. run the merge statement
  
  
merge into my_project.my_schema.events as DBT_INTERNAL_DEST
        using (
    select * from my_project.my_schema.events__dbt_tmp
  ) as DBT_INTERNAL_SOURCE
        on date(DBT_INTERNAL_DEST.session_start_timestamp) in unnest(dbt_partitions_for_upsert) and 
            DBT_INTERNAL_SOURCE.event_id = DBT_INTERNAL_DEST.event_id

dbt 0.15.2

dbt run --target dev -m events
Running with dbt=0.15.2
...
06:41:13 | 1 of 1 OK created incremental model my_schema.events [MERGE (514084) in 66.85s]
Duration	
1 min 4 sec
Bytes processed	
2.87 GB
Bytes billed	
2.87 GB

dbt 0.16.0 rc2

dbt run --target dev -m events
Running with dbt=0.16.0-rc2
...
06:43:54 | 1 of 1 OK created incremental model my_schema.events [SCRIPT (4.0 GB processed) in 93.52s]
Duration	
1 min 31 sec
Bytes processed	
4.05 GB
Bytes billed	
4.06 GB

Any ideas?

Ha, @claus that is… not what we want. I can do some digging to see why that would be the case. I can buy that this approach is slower (we run more queries) but I’m shocked that more bytes are processed! I would have though it would be strictly less data scanned in these queries. Can you share your config for this model + a representative example of the incremental predicate for this model?


Can you also try using the (undocumented and beta-ish) insert_overwrite approach outlined here: Feature/bq incremental strategy insert_overwrite by jtcohen6 · Pull Request #2153 · dbt-labs/dbt-core · GitHub. By setting the incremental_strategy to insert_overwrite, dbt will ignore the unique_key (you don’t need to provide one) and insert replace records in the destination table based on the partitioning config. Definitely do this in dev, as it’s destructive and not well documented at the moment (but that will change soon!)

cc @jerco

Thanks @drew! I have a hunch that at this scale, the overhead of running the additional queries so far outweighs the benefits of new merge strategy. I will reach out on Slack re: config details.
I’ll also check out the insert overwrite approach this weekend and will let you know.

For those following along at home, @claus and I managed to get to the bottom of this, and I wrote up an issue here. Really exciting stuff! BigQuery is smarter than we are, sometimes :slight_smile:

1 Like

On dbt run of an already-existing relatively plain set of output tables. dbt is throwing the same error for every output. "Table object has no attribute range_partitioning". Any idears??

Target = BigQuery
Version = 0.16.0-rc2

Hey @warpraptor - I played around with this today and I wasn’t able to replicate the behavior that you’re seeing. Is there any chance you may have an out-of-date version of google-cloud-bigquery? You can check with:

pip freeze | grep google-cloud

I see:

google-cloud-bigquery==1.24.0
google-cloud-core==1.3.0

Which versions are you seeing here?

google-cloud-bigquery==1.20.0
google-cloud-bigtable==1.0.0
google-cloud-core==1.0.3

I upgraded google-cloud-core and google-cloud-bigquery and now…

image

Thanks!!

Thanks for the update! We’re going to try revving that google-cloud-bigquery dependency for the 0.16.0 release - thanks for bringing it to our attention! https://github.com/fishtown-analytics/dbt/issues/2213