For those that batch hourly or more frequently, what problems have you run into and how did you overcome them?
Curious to learn about best practices and how far people have pushed batching with dbt.
For those that batch hourly or more frequently, what problems have you run into and how did you overcome them?
Curious to learn about best practices and how far people have pushed batching with dbt.
Per Ben Edwardsâ excellent advice, you can make a custom dbt materialization for minimizing boilerplate. Letâs leave out implementation details and instead talk about the concept, the inputs, and the output.
if you can maintain a clear monotonically increasing primary key column with a consistent column name (letâs call it primary_key
for the moment), you can embed strong assumptions as code in a custom materialization - if your compiled sql should look like
[CREATE TABLE | INSERT INTO]{{incrementally_updated_materialized_view}} AS
SELECT *
FROM {{compiled sql select statement}}
WHERE {{primary_key}} > (SELECT MAX(primary_key) FROM {{incrementally_updated_materialized_view}})
then you can pretty easily template the whole thing as a custom materialization where all you need to do is specify which of the columns in is the primary key as an input to the materialization. If we call the primary key incremental_key
and the incrementally updated model this_is_fun
pulling from a table fun.facts
, then your uncompiled sql might look something like
#/path/to/models/this_is_fun.sql
{{materialized=simple_incremental, incremental_key='created_ts_utc'}}
SELECT created_ts_utc, foo, bar, baz
FROM ref('fun','facts');
which would become
INSERT INTO {{target}}.this_is_fun
SELECT created_ts_utc, foo, bar, baz
FROM fun.facts
WHERE fun.facts.created_ts_utc > (SELECT MAX(created_ts_utc) FROM {{target}}.this_is_fun)
and if the table didnât exist it would run a create statement.All the standard incremental macro stuff can be abstracted away if youâre willing to accept constraints, and you can extend it as needed to handle incrementally updating tables with multiple highwater marks.
Here is a simple one â have the job time out if it executes for longer than you think it should. This will prevent cascading failures from queued jobs if a single query hangs. dbt Cloud lets you configure this out of the box
Ok, finally found time to write this.
It is very possible to write super efficient dbt projects. I would go as far to say that while dbtâs first design goal is to make your engineering workflow easier, itâs second is to make it faster to run. Recently my team scaled out our dbt-based data pipeline to work across 90 pilot tenants concurrently, executing every 15 minutes. Iâll let you do the math on how many executions that is. Thanks to the tooling that dbt provides, we were able to get the average execution time down to under 2 minutes.
This is the result of months of dev, and there are quite a few things we learned. Most importantly is to realise the tradeoff you are doing. Beyond a certain point, performance improvements come at the cost of increased complexity. Complexity for your data analysts/engineers maintaining your models, complexity for your ops team maintaining your batching process. This complexity needs to be managed and mitigated to keep your code/pipeline maintainable. So here are some tips for speeding up your code, and suggestions for keeping your code maintainable.
Less models to run is always going to be faster. If your EL tool has the ability to tell you what models have new data since the last time you started a batch, you can use that to build up a list of models to pass into dbt with dbt run --models xxx+ yyy+ zzz+
. How effective this is depends on your sources and how frequently your sources update, it also depends on how you are combining these sources into models.
Placing views in front of your source tables is also very useful (and can help with basic data cleaning). Views for models are also good but be wary of potential performance impacts of views being used in downstream models.
Calling dbt with a filter on models sounds easy enough, and it has the benefit of not impacting your data analystâs workflow in any way. But it does introduce some larger concerns for your overall pipeline:
FULL_REFRESH
those models when they do.None of these concerns can be (or should be) solved by anything dbt offers. And some of these should be done regardless what frequency you run batches, as it is just good sense for automation.
materialized="incremental"
is your friend. Incremental is the tool dbt offers to make fast materializations possible. Embrace its power, but you must also understand it. You should be very comfortable with how incrementals work, how your models are executed differently on full refreshes vs incremental runs, and know how you are handling data deletion if that is a thing in your system (in this age of GDPR, itâs hard for deletion to not be required). Having said that, try to replace all table materializations with incrementals.
Incrementals add complexity to your dbt project, as suddenly all your models require branching logic that changes the final sql. It wonât take you long to realise that this code is very similar in almost all your models. It becomes boilerplate, code you canât live without, mostly copy/paste, but disastrous if it has a mistake in it. At this point you should try to make the most of conventions in your code.
Incremental updates work by tracking some column in the referenced models and seeing if it is newer than the newest value in the existing model. These versioning columns should be consistent throughout your project. Same name, same datatype, same behaviour. Doing this means that your incremental logic will always look the same, and it allows you to macroify some common patterns in your models.
If most models have the same pattern for incremental logic, then this should be your next target for conventions. Try writing an incremental model that generates the incremental conditions for you. @drop_database_public has gone into how this can be done above, and you can look here for an example that I use.
These conventions wonât work for all scenarios (remember the 80/20 rule?). And the regular incremental model is still there if you need it. But you should be manually implementing materialization logic by exception, not by convention.
Ephemeral models exist to allow you to split up complex models into easy to understand chunks. When going for fast batches though, you should consider replacing them with work tables that are materialized incrementally. This can have a big improvement on incremental performance of complicated models. Do as many work models as makes sense for your model. This all comes at the cost of space, hopefully that isnât a huge concern.
dbt has a wealth of data about itâs runs that it doesnât expose via the command line. Even when it does, you want a way to measure across all your runs. Odds are, if you are reading this, you have a data warehouse, so letâs use it. Use dbtâs start and end hooks to record when a batch started and finished, and with what parameters. dbt has a very detailed result model, dump all of it to your data warehouse.
Next, you probably have a fancy BI tool plugged into your data warehouse, so lets use that. You should be able to answer the following questions easily:
It should be easy to see how this data can be used by you, and your data analysts, and even by your consumers.
Table transformations are the slowest possible model you can have. But they are a good baseline, their performance scales with the size of the dataset, not the size of the new data. At the other end of the scale there is the âperfectâ incremental model. This model has the perfect query plan, has every index it requires, and only works with data that has changed. The performance of this model scales only with the volume of new data. This is what you want to work towards, though it can be hard to tell how close you are to it. But there will get a point where you will start writing increasingly creative sql to try to squeeze out a few seconds more performance. Donât going insane trying to make it faster
Tests are an interesting use case, as they will never positively affect your the speed of your runs. I feel they still have value as a way to sanity check that your models contain good quality data. Even then, donât test things that are already tested. For example, donât test a primary key is unique on a model if you already added a unique index as a post hook.
Not much to say here, whatever method you were using for a daily batch would probably work for hourly or less as well. One word of warning though, be very careful to ensure that you never run multiple dbt jobs at the same time. Iâve had it happen before, the results are not pretty.
can you expand on ânever run multiple dbt jobsâ at the same time? please?