Best Practices for Robust, Idempotent Incremental Models (Beyond Timestamps)

Hello dbt Community!

I’m working on designing a near real-time pipeline and want to ensure our dbt incremental models are as robust as possible.

My background is in stream processing with tools like Spark Streaming, where you rely on stateful checkpoints to guarantee exactly-once processing. When looking at dbt’s incremental models, the most common pattern I see is filtering based on a max(updated_at) timestamp.

My concern with this approach is its potential fragility:

  1. Failure Recovery & Missed Windows: If my orchestrator (e.g., Airflow) runs a dbt job every 5 minutes and the 10:05 AM run fails, the next successful run at 10:10 AM will filter for timestamp > max(timestamp) from before the 10:05 run. This creates a 5-minute gap where data might be skipped. How does the community architect its jobs to be resilient to these orchestrator failures and prevent data gaps?

  2. Duplicate Data with an append Strategy: My second concern is about re-runs. If I need to manually re-run a job for a specific historical period, a simple timestamp filter will re-select all the data from that period. If the model is using incremental_strategy = ‘append’ (which is common for immutable event logs or tables without a reliable primary key), this will create duplicates. What are the best practices for safely re-running append-only incremental models?

Beyond using a timestamp, what are the community’s recommended patterns for building truly idempotent and reliable incremental models?

I’m particularly interested in:

  1. The pros and cons of relying solely on a unique_key with a MERGE strategy. How do you handle performance on very large source tables?

  2. Patterns for using CDC data with a Log Sequence Number (LSN) or version number instead of a timestamp.

  3. Are there any established best practices for a “manual checkpoint” system, for example, using a control table to manage state?

Thanks for sharing your insights!