Release: v0.20.0 (Margaret Mead)

jerco · June 9, 2021, 1:35pm

Updates

[Sep 8] v0.20.2, a patch release with bug fixes, is now available.
[Aug 11] v0.20.1, a patch release with bug fixes, is now available.
[Jul 12] v0.20.0 (final) is now available.
[Jun 30] v0.20.0-rc2 is available.
[Jun 04] v0.20.0-rc1 is available for prerelease testing.

Who is Margaret Mead ? Check out the release notes for a biography of this famous Philadelphian

dbt v0.20.0 (Margaret Mead) is now available on PyPi, Homebrew, DockerHub, and dbt Cloud. The two biggest areas of focus are Tests and Performance, which I’ll discuss below. There’s lots more in this release, though, so I encourage you to read:

Changelog for the full set of features, fixes, and under-the-hood tweaks
Migration guide for an overview of new and changed documentation

Installation

# with pip, install a specific adapter
pip install --upgrade dbt-<adapter>

# with Homebrew, install four original adapters
brew upgrade dbt

A few notes:

If you’re installing from PyPi, we recommend specifying your adapter as dbt-adapter (e.g. dbt-postgres). This way, you install just what you need, and avoid any dependencies you don’t. If you’re installing from Homebrew: We haven’t yet built a separate formula for each adapter, but we plan to in the future.
dbt-core==0.20.0rc1 includes a new dependency, tree-sitter. (See the experimental parser section below.) This requires a C compiler, such as GCC, to successfully install. We’re working to remove this requirement ahead of the final release. Update: We removed this requirement in 0.20.0rc2.

Breaking changes

Note that this release includes breaking changes for:

Custom generic (schema) tests. All test queries should return a set of rows, rather than a single numeric value. In most cases, this is as simple as switching select count(*) to select *.
Users and maintainers of packages that leverage adapter.dispatch(). See docs for full details.
Artifacts: manifest.json and run_results.json are now using a v2 schema.

Tests

We’ve written previously about all the exciting directions community members are going with dbt’s testing functionality. I’ve seen frameworks for unit testing, regression testing, you-name-it testing. There’s so much that you can do with dbt tests: they’re just macros; they’re just SQL.

At the same time, tests have been finicky and unintuitive. They’re a critical part of dbt—and we want to go far with them—so, for now, we’re securing their foundations. We’re looking to release dbt v1.0 later this year, and bringing tests up to parity is one of our highest priorities ahead of dbt’s first major-version release.

In v0.20, tests will:

Be more consistent between their one-off (“data test”) and generic (“schema test”) implementations, where the latter is just a reusable version of the former
Execute via a 'test' materialization, rather than mysterious python code
Be configurable from dbt_project.yml, including the ability to set default severity, or disable tests from packages
Support a number of new configurations, all out of the box, including:
- where filters on the underlying model, seed, snapshot, or source being tested
- warn_if and error_if conditional thresholds, based on the number of failures
Store failing records in the database for easy development-time debugging, if that’s something you want

There are things we didn’t get to, which I want to call out because they’re still great ideas. We may still take a swing at these ahead of releasing dbt v1.0 later this year:

Supporting plain-language descriptions for tests. This intersected with performance improvements in a way we couldn’t do both simultaneously. I still want to get to a place where a failing unique test on the id column in the customers table returns a sentence like: Found 5 duplicated values of customers.id, erroring because 5!=0.
Better FQNs, to make it easier to configure an individual test from dbt_project.yml—or, say, all tests on a given subfolder of models.
Defining generic test blocks inside the tests/ folder, so that generic and one-off tests cohabitate in harmony. For now, they still need to live in macros/.
Renaming schema_test and data_test in the codebase. I’ve started calling these generic and bespoke, which feels much more accurate, but those words don’t roll right off the tongue. If you have good ideas, I’d love to hear them!

Performance

We’ve seen that dbt v0.19.1 offers, on average, 3x faster parsing versus v0.19.0. That means projects which used to take 1 minute between typing dbt run and seeing the first model execute are down to 20 seconds. That’s an amazing, hard-won improvement—and it’s still not fast enough. We want projects of all sizes, whether 100 models or 5k models, to start up in fewer than 5 seconds while you’re developing.

To accomplish this, we’ve included two big features in v0.20.0: a top-to-bottom rework of partial parsing, and an experimental parser that can statically analyze the majority of dbt models. Both features are off by default; we encourage you to give them a try, and let us know what you find. For more details, see our fresh new docs on parsing.

Partial parsing rework

Partial parsing is a feature that’s been around for some time—two years, to be precise. If you’ve ever used dbt Cloud’s IDE, you’ve benefitted from partial parsing, even if you didn’t know it at the time.

The premise of partial parsing is simple. In development, you’re probably only editing a handful of files at a time. Rather than reread every file, and rebuild your entire project state from scratch, every time, dbt should re-parse just the files that have changed.

Yet partial parsing has been far from perfect. That’s because there are parts of dbt’s “mise en place” that the old partial parsing just couldn’t help with, such as processing refs and rendering descriptions. Even if no files had changed, partial-parse runs could still take over a minute for some projects.

In dbt v0.20.0, that’s changing. In a project with 5000 files, changing 1 file and re-running with --partial-parse ought to start up in 5 seconds, no more.

Partial parsing still isn’t perfect: We’ve documented a set of known edge cases, where a full re-parse is necessary. We also touched a lot of code to make this possible, and so we’ll need your help testing this extensively. Please, please let us know if you encounter weird bugs or undocumented edge cases.

Experimental parser

dbt leverages a set of special Jinja macros—ref(), source(), and config()—to infer needed information, at parse time, about the properties of a model, its dependencies, and its place in the DAG. Extracting information from those macros has always required a full Jinja render—until today. We’ve coded up a way to statically analyze that information instead.

For now, the experimental parser only works with models, and models whose Jinja is limited to those three special macros. When it works, it really works: the experimental parser is at least 3x faster than a full Jinja render. Based on testing with data from dbt Cloud, we believe the experimental parser can handle 60% of models in the wild, translating to a 40% speedier model parser on average. We think it will yield at least some benefit in 95% of projects.

You can check it out by running dbt parse and dbt --use-experimental-parser parse, and comparing the results in target/perf_info.json.

jaypeedevlin · June 10, 2021, 7:22pm

It seems like the syntax for the experimental parser should be

dbt --use-experimental-parser parse

The provided syntax errored for me.

jerco · June 10, 2021, 8:04pm

You’re totally right! Just fixed above. Thanks for catching

data_ders · June 29, 2021, 1:22am

I spent 20 min looking for more info on how exactly tests can be configured in the dbt_project.yml, and found @kwigley’s PR description for #3392 to be most helpful. I imagine something like it will eventually be added to the Test configuration docs. However, I’m still unclear as to the purpose of where and limit.

tests:
  +warn_if: ">10"
  +error_if: ">100"
  +where: "date_col = current_date"
  +limit: 10
  +fail_calc: "count(*)"

Adds ~~four~~ five new test configs:

limit: Simple enough, templated out in the materialization. This is mostly a complement for dbt test --store-failures #3316

where: The way I did this was unbelievably hacky—see the test cases for yourself—but it works, and in a way that should be backwards compatible with all existing schema/generic tests, without them needing to change any part of their SQL definition. (This config doesn’t make sense for one-off tests.)

warn_if, error_if: The user supplies a python-evaluable string (e.g. >=3, <5, ==0, !=0) and dbt will compare the fail count/calc against it.

By default, set to !=0

Interaction with severity: A little tricky, but I actually think they work reasonably together. By default, severity: error, and dbt checks the error_if condition first; if not error, then check the warn_if condition; if the result meets neither, it passes the test. If the user sets severity: warn, dbt will skip over the error_if condition entirely and jump straight to warn_if.

fail_calc: User-supplied fail_calc for tests #3321

jerco · June 29, 2021, 1:59am

We’ve got some initial docs for where + limit:

Both configs are intended to help tests scale for larger data volumes.

where enables you to filter any model/source/seed/snapshot being tested, even if the generic test query doesn’t include a where statement in its definition. This can be useful if you have a large incremental model, perhaps clustered or partitioned, and you only want to test uniqueness or check for null values on the past few days of data.

limit caps the number of failures that will be returned by a given test query. When storing test failures, this config can safeguard against accidentally writing thousands (or millions) of rows, when you might only need a sample of 500 to detect a fan-out or find the root cause of unacceptable values.

data_ders · June 29, 2021, 9:37pm

missed that section of the docs somehow – my bad! in case someone else ever has the same issue as me I opened an UX issue on the docs repo for fishtown-analytics/docs.getdbt.com#714

p.s. happy discourse cake day to me lol

Topic		Replies	Views
Faster dbt startup in v0.19.1 (beta) Archive	1	4446	February 15, 2021
Release: v0.19.0 (Kiyoshi Kuromiya) Archive	1	4842	January 28, 2021
Pre-release: v0.16.0 (Barbara Gittings) Archive	13	5939	March 18, 2020
Release: v0.18.0 (Marian Anderson) Archive	2	3688	September 8, 2020
State of testing in dbt In-Depth Discussions testing	5	13947	January 20, 2021