Handling Intermittent Test Failures in dbt: A Case Study
Hey dbt community!
We’ve been experiencing an interesting issue with our dbt implementation that I wanted to share, hoping to get some insights or learn if others have faced similar situations.
The Issue:
- We have intermittent failures in our dbt runs, specifically with unique tests
- The failures occur randomly in automated runs
- Running a manual full-refresh always fixes the issue
- The process works fine automatically but sometimes breaks without any changes in the middle
What we’ve investigated:
- Data Analysis:
- Found duplicate records in BigQuery that shouldn’t exist according to our transformation logic
- The duplicates appear in a table despite having deduplication logic.
- Reading About Different Possible Root Causes:
- Timing/race conditions
- Data arriving in different orders between executions
- Concurrency issues in data loading
- Cache issues with temporary data not cleaning properly
- Inconsistencies in source data
Current Workaround:
- Manual full-refresh resolves the issue temporarily
- We’ve tried automating the full-refresh but eventually test fails. However a full-refresh manually triggered always works.
Questions for the Community:
- Has anyone experienced similar intermittent unique test failures?
- What strategies have you implemented to handle race conditions in dbt?
- Are there best practices for managing cache and temporary data in dbt?
Would love to hear your thoughts and experiences!