Avoiding collisions during MD5 surrogate key creation

Hi all,

I searched the forum but did not find a thread specific to my topic, but only closely related ones as per the following:
For surrogate key creation, dbt utils implements md5 hashing [1]
Auto incrementing has other downsides and is generally not supported [2].

My question now is: how do you handle collisions?
While you can and should definitely test for uniqueness in your keys - as you cannot enforce it - what do you do if your test fails and you face a collision? Do you add some salt or something?

Appreciate your answers!

How much data do you have?

MD5 has its flaws, but random collisions are extremely rare. On average, you would need 2^64 records before you get a single collision – that’s 18 billion billion records (1.8 * 10^19).

You could use a hash with even more entropy, like SHA2, but if you have enough data to be concerned about MD5 collisions, then the performance penalty of a different hash is probably also going to be a concern. But I suppose if you have a quintillion records in your table, you could build up an incremental model that uses SHA512, which would scale up without collisions until you have more records than there are atoms in the universe (~10^80)

we use UUIDs, and Snowflake’s UUID_STRING() specifically to generate most of our unique ids.