Understanding idempotent data transformations

dbt is built on the assumption that your data transformations are idempotent – but what does that actually mean? And why was that design decision made?

What is an idempotent data transformation?

In computing, an operation is idempotent when calling it multiple times does not have a different result to calling the operation once. A real-life example of idempotence is a crosswalk-button — pressing the button once, or a dozen times, has the same effect.

Applying the concept of idempotence to data transformations can be a little difficult, since the inputs (in this case, your source data) are always changing. In data transformations, idempotence is demonstrated in the following scenarios:

  • If your source data were to stop updating, successive runs of your transformations would still result in the same tables and views in your warehouse.
  • If your production deployment of your transformations were interrupted, the next run of the transformations would result in the same tables and views as if the deployment had not been interrupted.
  • If you manually triggered transformations between scheduled runs, the scheduled run would result in the same tables and views as if the manual runs had not been triggered.
  • If the objects that data transformations builds were dropped, you could rebuild them from scratch.

How do other tools approach idempotence?

For those that have only used dbt, it may seem strange that some tools do not perform idempotent transformations, so it’s worth diving in here.

A pattern in an ETL tool might be:

  1. Write a query to find the revenue for each order that was placed in the last 24 hours (with 24 hours hardcoded in)
  2. INSERT the query results into a target table, fct_orders.
  3. Run this transformation every day to capture daily revenue

:point_up: This is a little exaggerated to get the point across, but it’s not out of the realm of possibility (I’ve seen some bad ETL code out there!). You can see in this example that the transformation is non-idempotent – the results in fct_orders are very much tied to the number of times the transformation is executed. If you miss a day, well, you won’t have that data, and if you run this twice in a day, you’ll have duplicated records. Eek!

In comparison, the idempotent version of this would be to recreate the table on each run, or write logic to incrementally process only the new records since the last run if the table already exists.

Scenarios when idempotence comes in handy

If you write idempotent transformations, you’re going to thank yourself in the following scenarios:

  • An ETL run fails: The nature of ETL jobs is that they’ll fail from time to time, resulting in missed runs. If the results of your transformations depends on the timing of your runs, a run failure becomes much more problematic. In comparison, idempotent transformations will effectively pick up where they left off, so the next time you run them, it will be as though the failure never happened. Magic! :mage:

  • Your business logic changes: Transformed data often includes both cleansing and business logic. Business logic can change over time (or may have had initial mistakes) – by ensuring transformations are idempotent, you can simply update your model logic and rebuild it! In comparison, if you’re running non-idempotent transformation, you might have to do a “backfill” to refresh this data.

  • You want to refresh your prod schema(s), or rebuild models from scratch: From time to time, it can be a good idea to drop your production schema(s) and rebuild your transformations. This can help clear out any lingering models that are no longer part of your transformation logic. Similarly, handling late-arriving data can be difficult, and sometimes it’s just easier to fully rebuild a model (--full-refresh in dbt parlance) to handle any edge cases.

  • You want to create dev environments that mirror production: If the results of your transformations depend on the timing of the runs, it really limits your ability to set up a development environment that mirrors production. As a result, building in dev becomes harder – you can’t do cool things like point your BI tool to the dev environment while to check that a dashboard makes sense, because that fct_orders table is going to have far fewer records than the production version.

  • You accidentally drop your prod schema: Look, we’ve all been there. If you feel good about your ability to rebuild your prod schema, you can be less worried about this!

How does dbt approach idempotence?

In dbt, the assumption of idempotence is really baked in – this is why models are defined as SELECT statements, while dbt handles wrapping the query in the appropriate data definition language (DDL) to create or update the relations in your table (read more here). You don’t need to write CREATE , INSERT , MERGE or statements when using dbt. If you try to, you’ll likely feel like you’re swimming upstream.

12 Likes

Idempotence seems like a very abstract mathematical concept until you spend a while dealing with all the pain that non-idempotent ETL creates. If you are never sure if running your code and refreshing the data from scratch would actually yield the same results as doing the additional incremental load that you have written, than you are dealing with the pain that non-idempotence produces. Also, a scenario that I have run into in multiple situations is

  1. Having a database with a bunch of tables/views/stored procedures/etc. in it
  2. Having a source code repository with a copy of the database objects in it
  3. Having no idea if the objects in the repository actually match the objects in the database and/or even if they do now having no system to redeploy or check in the future (a developer could make a change to the database at any time)

Having a tool like dbt which does idempotent deploys of database objects (we like putting functions, stored procs, macros, anything we might need in our database) and doing it directly from our source code repo via dbt cloud entirely solve this problem.

2 Likes