Production rollbacks

Hi community!

I am doing some thinking on how to do “production rollbacks”. I’ll define a production rollback as an operation that can be executed to restore a data warehouse to a known correct state quickly.

(Think of a scenario when some data goes out to production, happens to be wrong, someone notices, and the operations team wants to restore the data to a good state quickly, so they can debug the issue in dev and QA environments.)

Some other discussions I’ve found in the community on this topic:

I was speaking with a dbt user today and we came up with this algorithm with dbt Cloud and Snowflake using the Snowflake Time Travel feature:

  1. In the on-run-start hook of a dbt Cloud production job, record the latest query ID or timestamp in a Snowflake table called rollback.
  2. Execute the dbt Cloud production job.
  3. Observe an issue. :boom:
  4. Execute a dbt Cloud rollback job, that, in a run-operation:
    a. Reads the latest entry in the rollback table.
    b. Clones the historical schema or database (link) of the previous production data.
    c. Swaps the cloned schema or database into production.

I haven’t tried this out myself yet, but itching to try it out.

Curious to hear if anybody else has algorithms they use?