dbt is built on the assumption that your data transformations are idempotent – but what does that actually mean? And why was that design decision made?
What is an idempotent data transformation?
In computing, an operation is idempotent when calling it multiple times does not have a different result to calling the operation once. A real-life example of idempotence is a crosswalk-button — pressing the button once, or a dozen times, has the same effect.
Applying the concept of idempotence to data transformations can be a little difficult, since the inputs (in this case, your source data) are always changing. In data transformations, idempotence is demonstrated in the following scenarios:
- If your source data were to stop updating, successive runs of your transformations would still result in the same tables and views in your warehouse.
- If your production deployment of your transformations were interrupted, the next run of the transformations would result in the same tables and views as if the deployment had not been interrupted.
- If you manually triggered transformations between scheduled runs, the scheduled run would result in the same tables and views as if the manual runs had not been triggered.
- If the objects that data transformations builds were dropped, you could rebuild them from scratch.
How do other tools approach idempotence?
For those that have only used dbt, it may seem strange that some tools do not perform idempotent transformations, so it’s worth diving in here.
A pattern in an ETL tool might be:
- Write a query to find the revenue for each order that was placed in the last 24 hours (with 24 hours hardcoded in)
-
INSERT
the query results into a target table,fct_orders
. - Run this transformation every day to capture daily revenue
This is a little exaggerated to get the point across, but it’s not out of the realm of possibility (I’ve seen some bad ETL code out there!). You can see in this example that the transformation is non-idempotent – the results in fct_orders
are very much tied to the number of times the transformation is executed. If you miss a day, well, you won’t have that data, and if you run this twice in a day, you’ll have duplicated records. Eek!
In comparison, the idempotent version of this would be to recreate the table on each run, or write logic to incrementally process only the new records since the last run if the table already exists.
Scenarios when idempotence comes in handy
If you write idempotent transformations, you’re going to thank yourself in the following scenarios:
-
An ETL run fails: The nature of ETL jobs is that they’ll fail from time to time, resulting in missed runs. If the results of your transformations depends on the timing of your runs, a run failure becomes much more problematic. In comparison, idempotent transformations will effectively pick up where they left off, so the next time you run them, it will be as though the failure never happened. Magic!
-
Your business logic changes: Transformed data often includes both cleansing and business logic. Business logic can change over time (or may have had initial mistakes) – by ensuring transformations are idempotent, you can simply update your model logic and rebuild it! In comparison, if you’re running non-idempotent transformation, you might have to do a “backfill” to refresh this data.
-
You want to refresh your prod schema(s), or rebuild models from scratch: From time to time, it can be a good idea to drop your production schema(s) and rebuild your transformations. This can help clear out any lingering models that are no longer part of your transformation logic. Similarly, handling late-arriving data can be difficult, and sometimes it’s just easier to fully rebuild a model (
--full-refresh
in dbt parlance) to handle any edge cases. -
You want to create dev environments that mirror production: If the results of your transformations depend on the timing of the runs, it really limits your ability to set up a development environment that mirrors production. As a result, building in dev becomes harder – you can’t do cool things like point your BI tool to the dev environment while to check that a dashboard makes sense, because that
fct_orders
table is going to have far fewer records than the production version. -
You accidentally drop your prod schema: Look, we’ve all been there. If you feel good about your ability to rebuild your prod schema, you can be less worried about this!
How does dbt approach idempotence?
In dbt, the assumption of idempotence is really baked in – this is why models are defined as SELECT
statements, while dbt handles wrapping the query in the appropriate data definition language (DDL) to create or update the relations in your table (read more here). You don’t need to write CREATE
, INSERT
, MERGE
or statements when using dbt. If you try to, you’ll likely feel like you’re swimming upstream.