Hey all, I published a comparison of 3 approaches to customer identity resolution using the same dataset of 6,500 records:
- Pure dbt SQL (5 models + macros, runs on DuckDB) - 45% merge rate
- Splink + DuckDB - 66% merge rate
- Kanoniv (YAML spec, Rust engine) - 63% merge rate
The dbt SQL version uses dbt seed to load CSVs, Jaro-Winkler scoring, and iterative label propagation for clustering. It’s a good reference for anyone who wants to see what hand-rolled identity resolution looks like in pure SQL.
Repo with all three + shared data:
I also maintain a dbt package (dbt-kanoniv) for teams using the cloud version.