Production rollbacks

Hi community!

I am doing some thinking on how to do “production rollbacks”. I’ll define a production rollback as an operation that can be executed to restore a data warehouse to a known correct state quickly.

(Think of a scenario when some data goes out to production, happens to be wrong, someone notices, and the operations team wants to restore the data to a good state quickly, so they can debug the issue in dev and QA environments.)

Some other discussions I’ve found in the community on this topic:

I was speaking with a dbt user today and we came up with this algorithm with dbt Cloud and Snowflake using the Snowflake Time Travel feature:

  1. In the on-run-start hook of a dbt Cloud production job, record the latest query ID or timestamp in a Snowflake table called rollback.
  2. Execute the dbt Cloud production job.
  3. Observe an issue. :boom:
  4. Execute a dbt Cloud rollback job, that, in a run-operation:
    a. Reads the latest entry in the rollback table.
    b. Clones the historical schema or database (link) of the previous production data.
    c. Swaps the cloned schema or database into production.

I haven’t tried this out myself yet, but itching to try it out.

Curious to hear if anybody else has algorithms they use?

3 Likes

Alright, I was noodling on this with an esteemed colleague at dbt Labs and they think the following algorithm is a winning strategy for Snowflake:

create a production job which:

  1. clones the production database/schema into a new schema (run-operation)
  2. runs the dbt run against the cloned schema
  3. swaps the new schema with the old production schema (run-operation)

If anything goes awry, they can run another job to swap the old and new databases/schemas and effectively perform a rollback.

@boxysean If I’m understanding these workflows correctly, your original proposal is more of a table / view level approach, and the second proposal is a schema / database level swap. Feels like the second approach is more scalable with how dbt jobs are designed to populate many tables in a schema / database in a single run. Do you have any additional thoughts?