Analyzing Fishtown's dbt project performance with artifacts

Heads up:

We’re about to describe a project we took on that is an MVP, and is only available to Snowflake users. You may find it interesting if you’re using a different warehouse, but wanted to let you know early before I disappoint you!

This article also assumes you’re familiar with:

A brief history of analyzing project performance

Many years ago, we realized that some of the best teams wanted a way to monitor their dbt run times to answer questions like:

  • What are my slowest-running models?
  • Is my total run time increasing over time? Did someone add a model a month ago that caused our whole run to slow down?
  • What is my test success rate over time?

In essence, these teams want to perform analytics on their analytics (I like calling this “metalytics”), with the goal of using this information to optimize their build times.

In mid-2018, we came up with an idea:

:bulb: What if we inserted a record into a table every time a model starts and completes?

It seemed pretty clever, so we published the event-logging package to achieve this. Unfortunately, adding these insert statements slowed down a dbt project significantly — if the goal of implementing monitoring is to give information to a team so they can optimize a dbt project, then we’ve already just taken a huge step backwards when implementing our monitoring.

Despite slapping a “:warning: WARNING!” label on the repo’s readme, this package continues to be our fifth-most popular dbt package, with 1 in 40 projects using it. To us, this indicates that we needed to find a better way. We tried using the results object instead, but could never get it to quite work the way we wanted to, and still felt like it wasn’t the right answer.

A quick detour: dbt artifacts

Every time you run a dbt command, dbt creates a number of files in your target directory to describe the output of that command — these files are referred to as “artifacts”. In particular, dbt creates:

  • manifest.json: a full representation of your dbt project’s resources (models, tests, macros, etc), including all node configurations and resource properties.
  • run_results.json: information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed.
  • sources.json: information about source freshness checks
  • catalog.json: information from your data warehouse about the tables and views produced and defined by the resources in your project.

(Check out the docs for more information on which commands produce each file, as well as more information about the contents).

Based on the above descriptions, we can tell that these files contain valuable information about the project timing, model dependencies and even the results of test commands. So, naturally, some curious community members have had the following light bulb moment:

:bulb: What if we analyzed the artifacts to report on the performance of a dbt project?

As long as we:

  1. can get these files into our data warehouse, AND
  2. are using a warehouse that handles JSON natively

Then we should be good to go!

Some of our community members have been blazing the trail on this front, and have shared their findings with the community — you can read some prior art here:

This felt like a super promising solution to us, but there were a few things in our way from implementing this in our own project:

  • We use dbt Cloud for orchestrating our dbt runs, and there was no way to get these files out of dbt Cloud and into our data warehouse (or even into cloud storage). The above teams above are using other orchestration tools.
  • We had some work on the roadmap that felt related to this project (more on this below)
  • This wasn’t a priority for our own data team! We spend so much time working on other team’s analytics, that sometimes we neglect our own. We just hadn’t felt that reporting on our project run times was a priority when we still had a ton of models to build out! (I call this “the expert’s paradox” — I often wonder what the Looker instance at Looker looks like, or the HubSpot implementation at HubSpot)

Getting in on the action

A few various threads coalesce-d that meant we were ready to analyze our own project performance:

  1. In v0.19.0 we created a stable schema¹ of these files with a versioning system — previously we had changed these files whenever we needed to, and often introduced breaking changes if someone was relying on them. I guess we’re growing up ;’)
  2. The dbt_artifacts package I mentioned above gave everyone in the community (ourselves included) a huge head start on analyzing these files, especially if you’re a Snowflake user. Huge thank you to Niall Woodward!
  3. The Snaptravel team let us know about the put command in Snowflake, which allows you to upload local files to Snowflake — this was the real game changer; prior to this, we would need to wait for our engineering team to build a mechanism (a metadata API? :eyes::eyes:) to get these files out of dbt Cloud. Thank you so much to Kevin Chan and Jonathan Talmi for sharing your work with us!

¹Here, we mean “schema” in the sense of “the underlying structure of the data”. This can be a confusing term if your point of reference for the word “schema” is “the folder that my views and tables live in within my database”. As always — naming things is hard.

How did we get these files into Snowflake?

Note: if you’re using an orchestrator other than dbt Cloud, you might choose to instead achieve this via a file upload step to S3, in which case you can skip this section.

0. Understanding the put command

The put command allows you to upload files from a local directory to a Snowflake stage:

put file://target/run_results.json @analytics.dbt_artifacts.artifacts auto_compress=true;

This snippet of SQL creates a stage (named analytics.dbt_artifacts.artifacts), uploads a file to the stage, and then copies that file from the stage into a table (our convention is to give stages and tables the same name)

Expand me!
-- create the schema
create schema if not exists analytics.dbt_artifacts;

-- create the stage
create stage if not exists analytics.dbt_artifacts.artifacts file_format = ( type = json );

-- upload the file
put file://target/run_results.json @analytics.dbt_artifacts.artifacts auto_compress=true;

-- create the table
create table if not exists analytics.dbt_artifacts.artifacts (
    data variant,
    generated_at timestamp,
    path string,
    artifact_type string
);

-- copy the file from the stage
copy into analytics.dbt_artifacts.artifacts from (
    select
        $1 as data,
        $1:metadata:generated_at::timestamp_ntz as generated_at,
        metadata$filename as path,
        regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type
    from @analytics.dbt_artifacts.artifacts
)
file_format=(type='JSON')
on_error='skip_file';

-- clear out the stage
remove @analytics.dbt_artifacts.artifacts pattern='.*.json.gz'

Take a moment to make sure you follow this SQL :slight_smile:

Is there an equivalent command for Redshift and BigQuery?
Unfortunately, no. Both of these warehouses can only load data from cloud storage rather than locally. This is outside of our control.

1. Turning the SQL into a macro

Now that we know the SQL we need to run, we can parameterize it into macros! We created two macros:

  • create_artifact_resources: this macro only needs to be run once, to do all the initial setup on Snowflake
  • upload_artifacts: run this macro every time you want to upload files to Snowflake

More details here:

`macros/create_artifact_resources.sql`
{% macro create_artifact_resources() %}

{% set src_dbt_artifacts = source('dbt_artifacts', 'artifacts') %}

{{ create_schema( src_dbt_artifacts ) }}

{% set create_stage_query %}
create stage if not exists {{ src_dbt_artifacts }}
file_format = ( type =  json );
{% endset %}

{% set create_table_query %}
create table if not exists {{ src_dbt_artifacts }} (
    data variant,
    generated_at timestamp,
    path string,
    artifact_type string
);

{% endset %}


{% do log("Creating Stage: " ~ create_stage_query, info=True) %}
{% do run_query(create_stage_query) %}

{% do log("Creating Table: " ~ create_table_query, info=True) %}
{% do run_query(create_table_query) %}

{% endmacro %}
`macros/upload_dbt_artifacts.sql`
{% macro upload_dbt_artifacts(filenames) %}

{% set src_dbt_artifacts = source('dbt_artifacts', 'artifacts') %}

{% set remove_query %}
    remove @{{ src_dbt_artifacts }} pattern='.*.json.gz';
{% endset %}

{% do log("Clearing existing files from Stage: " ~ remove_query, info=True) %}
{% do run_query(remove_query) %}

{% for filename in filenames %}

    {% set file = filename ~ '.json' %}

    {% set put_query %}
        put file://target/{{ file }} @{{ src_dbt_artifacts }} auto_compress=true;
    {% endset %}

    {% do log("Uploading " ~ file ~ " to Stage: " ~ put_query, info=True) %}
    {% do run_query(put_query) %}

    {% set copy_query %}
        begin;
        copy into {{ src_dbt_artifacts }} from
            (
                select
                $1 as data,
                $1:metadata:generated_at::timestamp_ntz as generated_at,
                metadata$filename as path,
                regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type
                from  @{{ src_dbt_artifacts }}
            )
            file_format=(type='JSON')
            on_error='skip_file';
        commit;
    {% endset %}

    {% do log("Copying " ~ file ~ " from Stage: " ~ copy_query, info=True) %}
    {% do run_query(copy_query) %}

{% endfor %}

{% do log("Clearing new files from Stage: " ~ remove_query, info=True) %}
{% do run_query(remove_query) %}

{% endmacro %}

2. Uploading the files as part of our scheduled jobs

Now that we’ve templated our code into macros, we need a way to instantiate the macro after each step — of note is the fact that our production job has a number of different steps:

$ dbt seed --full-refresh
$ dbt snapshot
$ dbt run
$ dbt test
$ dbt source snapshot-freshness

We need to upload artifacts for each step, since we overwrite the artifact files every time we execute a step. An on-run-end hook felt like they might be the right answer however we realized that these files are still being written when the hook gets invoked.

Instead, we set up an operation after every single step to invoke the upload_dbt_artifacts macro.

$ dbt seed --full-refresh
$ dbt run-operation upload_dbt_artifacts --args '{filenames: [manifest, run_results]}'
$ dbt snapshot
$ dbt run-operation upload_dbt_artifacts --args '{filenames: [manifest, run_results]}'
$ dbt run
$ dbt run-operation upload_dbt_artifacts --args '{filenames: [manifest, run_results]}'
$ dbt test
$ dbt run-operation upload_dbt_artifacts --args '{filenames: [run_results]}'
$ dbt source snapshot-freshness
$ dbt run-operation upload_dbt_artifacts --args '{filenames: [sources]}'

4. Contributing the macros back to the dbt_artifacts package

We wanted to be good citizens here, so I took the time to contribute these macros back to the dbt_artifacts package! If you want to use these macros, try installing the package rather than copying the above code into your project.

Modeling our project performance

This was pretty easy given the head start that Niall had provided us — we simply installed the package to our project and the hard work was done! We also added some new models that we found useful ourselves

In particular, this package provided us with the following models

  • dim_dbt__models
  • fct_dbt__model_executions
  • fct_dbt__latest_full_model_executions
  • fct_dbt__critical_path
  • fct_dbt__run_results

I’ll add a link to the project docs for this package shortly!

Analyzing our project performance

Now that we have our models, it was time to experiment with ways to use this information. Here’s what we’ve done so far — all the visualizations were done in Looker, but a relatively simple queries on top of our modelled data.

1. Monitoring project run time over time

This visualization shows the duration of a “dbt run” step over time (excluding our full-refresh runs). You can see that our “dbt run” duration is pretty stable, though we’ve only got a couple of weeks’ of data.

When working on this viz, I realized that I wanted to report on the total “job” time (i.e. add together the timings for each step in our job, potentially displaying it as an area chart).

We have the data to do this — each dbt Cloud job has a number of environment variables that track the “run_id”, and these get passed through to the manifest.json. Something for us to tinker with in the future!

2. Identifying bottlenecks

The idea for this viz came from Drew. As dbt works through your project, it allocates workers (or threads) to each model. If we create a Ganntt chart, with the thread ID on the y-axis, and the model start and finish time on the x-axis, we can easily identify which models are causing “bottlenecks”.


We had no idea that one model was bricking our dbt run for almost half of the build time!

A really great way to visually display this information, and probably my favourite of the visualizations so far.

3. Test failure/warn rate

We are pretty good at triaging any errors in our dbt project (the notifications come into a dedicated #internal-analytics-status channel, and our Head of Data checks what’s failing and assigns a team member to fixing the error — usually it gets resolved the same day.)

So I was expecting our failure rate to be pretty low. Turns out, it’s zero:

Any guesses why?

click here to find out

If the dbt test command fails, we exit the run, so don’t get to the run-operation step that uploads the files. 0% error rate, baby!

4. Source freshness waterline

Here’s ours for our Slack source tables — nothing super interesting just yet, but something that was fun to explore.

Should the axis be inverted here, with 0 at the top and negative numbers? Is this the right viz type? Should I take a course on effective data visualization? All good things to think about.

What did we learn?

Keep the end goal in mind:

In adding this monitoring to our project, we wanted to understand how we could optimize our project performance. But by adding all these run-operations for monitoring, we added 2m30s to each run. Any performance improvements we make will have to be better than that! So I’m not entirely sure this was worth it :sweat_smile:

In working through this I went through various stages of optimism and skepticism of the value of this data. While this is really cool data, and a very fun project, I always try to relate my work back to a business goal — optimizing our own project performance didn’t feel as valuable as where I might otherwise be spending my time if I were on our data team.

Overall, I felt this was nice to know, but hard to act on (except for bottleneck viz, that’s rad! h/t Drew). I’d say that given the current complexity of undertaking this project, this kind of project should be one that only the most mature data teams takes on, and only when there’s serious project performance issues (change my view in the comments!). Of course, if we somehow made this project easier, and say, integrated into our product, it would open this kind of reporting up to all data teams!

The “put” command, and run-operations only provide a half-solution

There needs to be a way for all of our users to get access to this data reliably (not just Snowflake users, and not just in a way that only gets uploaded if a step is successful)

Further, having this analysis live in your BI tool is probably one step away from where you want it to live — a lot of this information is going to be more useful contextually, for example, if you wanted to see the historical runtimes of a particular model straight in your dbt project docs, or overlay the “bottleneck” viz on top of your current dbt DAG.

A very cool model for MVP functionality

It was really fun to keep an eye on the community, and build our own version based on what we had learned from others, and then be community members ourselves! If you attended Coalesce last year, it’s no secret that we’re thinking about metadata a lot, and this was a great opportunity to explore the value of this data in our own context!

What’s next?

As a community member, there’s a few things I would want to tackle next (in all honesty, I won’t get the opportunity to work on most of these, but wanted to get your creative juices flowing):

  • Some of the above visualizations use models that I’ve built on a branch of the artifacts package, it’s on me to contribute these back :slight_smile:
  • Can we join this information to Snowflake costs? Downstream usage? Etc
  • Can we contribute some changes to make the SQL multi-warehouse compatible?

As the makers of dbt, there’s even more we’re thinking about on this front.

In the near-term(-ish), we’re thinking about how to make the “upload files as a run-operation” approach less janky — for example, we’re looking into copying each step’s subfolders to a target/step_1 folder, so that we have one upload step at the end of a job, rather than an upload step after each step.

In the medium-term, we want to make it easier for dbt Cloud users of all warehouses to get this raw data into their warehouse, in a way that doesn’t require adding operations to each step! More to come on this front.

We’re also thinking about how we can leverage this data in our Cloud product — we feel that so much of this information is going to be more powerful in the context of your dbt project, rather than rendering it in your BI tool. If we can build this into our product, using this information to understand your dbt project is no longer a project that only mature data teams can take on, but is instead one that all data teams using dbt can leverage.

How can you contribute?

Help us build our body of knowledge by adding to this post in the comments section:

  • How are you using the metadata about your dbt runs on your data team? Do you have any “killer” visualizations?
  • How has having access to this data changed the projects your team works on?

Finally, if you are taking on a similar project, I recommend that you use Niall’s package and contribute back to it!


Thanks to @jerco for pairing with me on this project! And all the community members whose work we used: @emilyhawkins, @NiallRees, Kevin Chan, @jt_st

15 Likes

Thanks for the writeup and analysis ideas @claire! My esteemed colleague Ken Payne has a soon to be released dbt package which combines an also soon to be released Tableau metadata singer tap with the artifacts package to identify Tableau-dbt dependencies. Really cool stuff.

4 Likes

A few various thread s coalesce -d

Claire. :neutral_face:

This is really cool! I’d wondered how you were going to get the files from dbt Cloud without knowing the intricacies of the runner’s location on disk. Delighted to see that it’s just a relative reference! Good job Snowflake.

1 Like

Sweet writeup! FYI for folks running this outside of dbt cloud, we found the following bash commands useful:

dbt_run_commands = [f"""dbt run -m {dbt_models}; ret=$?;
                        dbt run-operation upload_dbt_artifacts --args '{{table: run_results}}';
                        exit $ret"""]

dbt_test_commands = [f"""dbt test -m {dbt_models}; ret=$?;
                         dbt run-operation upload_dbt_artifacts --args '{{table: test_run_results}}'; 
                         exit $ret"""]

Which allows us to upload artifacts even in the case of model/test failure.

3 Likes

Ha! classy. gonna have to use that myself… :wink:

@emilyhawkins’ session I think was my favourite amongst many at Coalesce. I’m looking to implement the above. I want users to see for themselves the freshness and test state of the data.

We’re an Azure shop, so we have Azure Storage, Snowflake on Azure, and use Power BI.

@claire how did you generate that Ganntt chart? Is it just a native Looker viz or is there some inspiration you can share?

Native Looker chart! It’s not one of their main ones though, which is why the time axis shows the date rather than the hour (it wasn’t customizable unfortunately)