Best practices for frequent batching

For those that batch hourly or more frequently, what problems have you run into and how did you overcome them?

Curious to learn about best practices and how far people have pushed batching with dbt.

1 Like

Per Ben Edwards’ excellent advice, you can make a custom dbt materialization for minimizing boilerplate. Let’s leave out implementation details and instead talk about the concept, the inputs, and the output.

if you can maintain a clear monotonically increasing primary key column with a consistent column name (let’s call it primary_key for the moment), you can embed strong assumptions as code in a custom materialization - if your compiled sql should look like

[CREATE TABLE | INSERT INTO]{{incrementally_updated_materialized_view}} AS
SELECT *
FROM {{compiled sql select statement}}
WHERE {{primary_key}} > (SELECT MAX(primary_key) FROM {{incrementally_updated_materialized_view}})

then you can pretty easily template the whole thing as a custom materialization where all you need to do is specify which of the columns in is the primary key as an input to the materialization. If we call the primary key incremental_key and the incrementally updated model this_is_fun pulling from a table fun.facts, then your uncompiled sql might look something like

#/path/to/models/this_is_fun.sql

{{materialized=simple_incremental, incremental_key='created_ts_utc'}}
SELECT created_ts_utc, foo, bar, baz
FROM ref('fun','facts');

which would become

INSERT INTO {{target}}.this_is_fun
SELECT created_ts_utc, foo, bar, baz
FROM fun.facts
WHERE fun.facts.created_ts_utc > (SELECT MAX(created_ts_utc) FROM {{target}}.this_is_fun)

and if the table didn’t exist it would run a create statement.All the standard incremental macro stuff can be abstracted away if you’re willing to accept constraints, and you can extend it as needed to handle incrementally updating tables with multiple highwater marks.

Here is a simple one – have the job time out if it executes for longer than you think it should. This will prevent cascading failures from queued jobs if a single query hangs. dbt Cloud lets you configure this out of the box

Ok, finally found time to write this.

It is very possible to write super efficient dbt projects. I would go as far to say that while dbt’s first design goal is to make your engineering workflow easier, it’s second is to make it faster to run. Recently my team scaled out our dbt-based data pipeline to work across 90 pilot tenants concurrently, executing every 15 minutes. I’ll let you do the math on how many executions that is. Thanks to the tooling that dbt provides, we were able to get the average execution time down to under 2 minutes.

This is the result of months of dev, and there are quite a few things we learned. Most importantly is to realise the tradeoff you are doing. Beyond a certain point, performance improvements come at the cost of increased complexity. Complexity for your data analysts/engineers maintaining your models, complexity for your ops team maintaining your batching process. This complexity needs to be managed and mitigated to keep your code/pipeline maintainable. So here are some tips for speeding up your code, and suggestions for keeping your code maintainable.

Only do work when you need to

Less models to run is always going to be faster. If your EL tool has the ability to tell you what models have new data since the last time you started a batch, you can use that to build up a list of models to pass into dbt with dbt run --models xxx+ yyy+ zzz+. How effective this is depends on your sources and how frequently your sources update, it also depends on how you are combining these sources into models.

Placing views in front of your source tables is also very useful (and can help with basic data cleaning). Views for models are also good but be wary of potential performance impacts of views being used in downstream models.

Calling dbt with a filter on models sounds easy enough, and it has the benefit of not impacting your data analyst’s workflow in any way. But it does introduce some larger concerns for your overall pipeline:

  • Your EL process now needs to track state
  • dbt needs to be told about EL’s state
  • dbt needs its own state to tell what it last processed
  • What about when models change? Guess we need to check for that and FULL_REFRESH those models when they do.
  • This is a lot of state to track, probably needs to be fault tolerant as well.

None of these concerns can be (or should be) solved by anything dbt offers. And some of these should be done regardless what frequency you run batches, as it is just good sense for automation.

Faster models

Incrementals

materialized="incremental" is your friend. Incremental is the tool dbt offers to make fast materializations possible. Embrace its power, but you must also understand it. You should be very comfortable with how incrementals work, how your models are executed differently on full refreshes vs incremental runs, and know how you are handling data deletion if that is a thing in your system (in this age of GDPR, it’s hard for deletion to not be required). Having said that, try to replace all table materializations with incrementals.

Incrementals add complexity to your dbt project, as suddenly all your models require branching logic that changes the final sql. It won’t take you long to realise that this code is very similar in almost all your models. It becomes boilerplate, code you can’t live without, mostly copy/paste, but disastrous if it has a mistake in it. At this point you should try to make the most of conventions in your code.

Conventions / keeping things DRY

Incremental updates work by tracking some column in the referenced models and seeing if it is newer than the newest value in the existing model. These versioning columns should be consistent throughout your project. Same name, same datatype, same behaviour. Doing this means that your incremental logic will always look the same, and it allows you to macroify some common patterns in your models.

If most models have the same pattern for incremental logic, then this should be your next target for conventions. Try writing an incremental model that generates the incremental conditions for you. @drop_database_public has gone into how this can be done above, and you can look here for an example that I use.

These conventions won’t work for all scenarios (remember the 80/20 rule?). And the regular incremental model is still there if you need it. But you should be manually implementing materialization logic by exception, not by convention.

Ephemeral vs work table

Ephemeral models exist to allow you to split up complex models into easy to understand chunks. When going for fast batches though, you should consider replacing them with work tables that are materialized incrementally. This can have a big improvement on incremental performance of complicated models. Do as many work models as makes sense for your model. This all comes at the cost of space, hopefully that isn’t a huge concern.

Measuring performance

dbt has a wealth of data about it’s runs that it doesn’t expose via the command line. Even when it does, you want a way to measure across all your runs. Odds are, if you are reading this, you have a data warehouse, so let’s use it. Use dbt’s start and end hooks to record when a batch started and finished, and with what parameters. dbt has a very detailed result model, dump all of it to your data warehouse.

Next, you probably have a fancy BI tool plugged into your data warehouse, so lets use that. You should be able to answer the following questions easily:

  • How fast are my jobs?
  • What models consumed the most time over the last X runs?
  • Did my code change improve or worsen performance?
  • What are the most recent errors that caused models to fail?
  • At any given time, what is the maximum age of the data in my data warehouse?

It should be easy to see how this data can be used by you, and your data analysts, and even by your consumers.

Where is the limit?

Table transformations are the slowest possible model you can have. But they are a good baseline, their performance scales with the size of the dataset, not the size of the new data. At the other end of the scale there is the “perfect” incremental model. This model has the perfect query plan, has every index it requires, and only works with data that has changed. The performance of this model scales only with the volume of new data. This is what you want to work towards, though it can be hard to tell how close you are to it. But there will get a point where you will start writing increasingly creative sql to try to squeeze out a few seconds more performance. Don’t going insane trying to make it faster :stuck_out_tongue:

Tests are an interesting use case, as they will never positively affect your the speed of your runs. I feel they still have value as a way to sanity check that your models contain good quality data. Even then, don’t test things that are already tested. For example, don’t test a primary key is unique on a model if you already added a unique index as a post hook.

Notes on scheduling

Not much to say here, whatever method you were using for a daily batch would probably work for hourly or less as well. One word of warning though, be very careful to ensure that you never run multiple dbt jobs at the same time. I’ve had it happen before, the results are not pretty.

6 Likes

can you expand on ‘never run multiple dbt jobs’ at the same time? please?