6500 Records of Customer Identity Resolution in 170 lines of YAML code

dareesunday · February 17, 2026, 5:25pm

Hey all, I published a comparison of 3 approaches to customer identity resolution using the same dataset of 6,500 records:

Pure dbt SQL (5 models + macros, runs on DuckDB) - 45% merge rate
Splink + DuckDB - 66% merge rate
Kanoniv (YAML spec, Rust engine) - 63% merge rate

The dbt SQL version uses dbt seed to load CSVs, Jaro-Winkler scoring, and iterative label propagation for clustering. It’s a good reference for anyone who wants to see what hand-rolled identity resolution looks like in pure SQL.

Repo with all three + shared data:

I also maintain a dbt package (dbt-kanoniv) for teams using the cloud version.

Topic		Replies	Views
dbt-iris -- adapter for InterSystems IRIS data platform Show and Tell	0	1260	June 13, 2023
Pre-release: v0.16.0 (Barbara Gittings) Archive	13	6029	March 18, 2020
Structure Snowflake database, schema In-Depth Discussions	19	15428	June 1, 2021
Unioning identically-structured data sources Show and Tell jinja , best-practice	12	38394	December 5, 2023
Connected DBT cloud and snowflake as source, not able to run models Help dbt-cloud	0	471	March 15, 2024

6500 Records of Customer Identity Resolution in 170 lines of YAML code

Related topics