Are full MD5 hashes a good default for surrogate keys?

dbt’s generate_surrogate_key macro takes a MD5 hash of the input columns and returns it as a string. I would expect string keys to perform worse than 64-bit integer keys in joins due to their larger size. In my benchmark , I found that that replace the default macro with one based on MD5_NUMBER_LOWER64 in Snowflake makes joins on that surrogate key much faster, to the order of 30%.

Now this replacement surrogate key only has 64 bits of entropy, compared to 128 bits of entropy for a full MD5 hash, so key collisions are more likely for large datasets sizes. However, even at 100 million rows, the collision probability is only 0.027% , and many tables in practice are well under that size. This makes me lean towards using the more efficient version if I know the size of a table will never be a problem for collisions.

Long-form version of this argument on my blog here.