Thanks for that insightful comment. Really appreciate it!
In my Autocompeter.com service (written in Go) I actually went for version h6(). I needed it to be as small as possible and that one produces, very quickly, a hash that is only 6 characters long and seems very safe for clashes.
Hi! This page is currently the 6th Google result for [python fast hash]! Unfortunately, the analysis here is somewhat confused, which could lead beginners astray.
MD5 would be better described as a flawed cryptographic hash function, rather than a "non-cryptographic hash function". (MD5 tried to be cryptographically strong, and was used in that fashion for quite a while, but turned out to be broken. A truly designed-to-be non-cryptographic hash function would be simpler and faster; more on that later.)
To meaningfully compare algorithms, you should do so based on their raw output (bits/bytes), not multiple varying printable representations (hex/base64, possibly truncated). Those encoding/truncation decisions are logically separate, and can be done at the end, to any suitable algorithm.
Against a random set of inputs like your "130,000+" words, taking the same number of leading bits from either MD5 or SHA256 will be equally likely to be free of collisions. (The known flaws in MD5 only come into play if inputs are chosen adversarially to engineer collisions.)
MD5 will be faster for the same amount of input, since it is a simpler algorithm with smaller state and output. (Your timing scores based on output-characters, as modified by a further encoding/truncation choice, are a bit fishy. It's isn't typical to rate a hash's speed by output characters, and your times may be dominated by other choices in your loop.)
Given your test – 130,000 (around 2^17) random words – there's a "birthday problem" probability calculation that predicts exactly how many unique, randomly-distributed output values any hash function would need to offer to make even one collision among those 2^17 probes unlikely: around 2^34. So it's no surprise that each of your shortest surviving hashes encodes 36 bits, with just the right number of characters to go over 34 bits. (Hex characters, with 16 values, encode 4 bits. So 8 characters (8*4=32 bits) would be too few bits, while 9 characters (8*4=36) is just enough. Similarly base32 characters encode 6 bits, so 5 characters (5*6=30 bits) would be too few, while 6 characters (6*6=36 bits) is just enough.)
So what technique should be used to create the sort of usually-unique, stable, short IDs you want?
If you need to stick with the Python standard library, starting with MD5 is a defensible choice. Then if compactness in text representation matters, represent the hashes in base64. Then if you want to accept more collision-risk than is likely with the full 128 bits, use the birthday-problem calculations to pick a number of retained bits that fit your size/collision tradeoffs, and truncate away the rest.
But remember MD5 is slower than hash functions that never intended to have cryptographic collision-resistance as a goal. So if speed is a goal, you could look outside the standard library. If you'd like to use the fast non-cryptographic hash functions that Google has published, look into their projects "CityHash" or "FarmHash". (These even have 256-bit options, on the off chance that the collision-risk with 128-bits is still too high for your application.) The first 36 (or 64 or 80 or 128) bits from these should be just as good as the bits from MD5, for your purposes. Then, apply the same encoding and truncation choices as may be appropriate.
Comment
Thanks for that insightful comment. Really appreciate it!
In my Autocompeter.com service (written in Go) I actually went for version h6(). I needed it to be as small as possible and that one produces, very quickly, a hash that is only 6 characters long and seems very safe for clashes.
Parent comment
Hi! This page is currently the 6th Google result for [python fast hash]! Unfortunately, the analysis here is somewhat confused, which could lead beginners astray. MD5 would be better described as a flawed cryptographic hash function, rather than a "non-cryptographic hash function". (MD5 tried to be cryptographically strong, and was used in that fashion for quite a while, but turned out to be broken. A truly designed-to-be non-cryptographic hash function would be simpler and faster; more on that later.) To meaningfully compare algorithms, you should do so based on their raw output (bits/bytes), not multiple varying printable representations (hex/base64, possibly truncated). Those encoding/truncation decisions are logically separate, and can be done at the end, to any suitable algorithm. Against a random set of inputs like your "130,000+" words, taking the same number of leading bits from either MD5 or SHA256 will be equally likely to be free of collisions. (The known flaws in MD5 only come into play if inputs are chosen adversarially to engineer collisions.) MD5 will be faster for the same amount of input, since it is a simpler algorithm with smaller state and output. (Your timing scores based on output-characters, as modified by a further encoding/truncation choice, are a bit fishy. It's isn't typical to rate a hash's speed by output characters, and your times may be dominated by other choices in your loop.) Given your test – 130,000 (around 2^17) random words – there's a "birthday problem" probability calculation that predicts exactly how many unique, randomly-distributed output values any hash function would need to offer to make even one collision among those 2^17 probes unlikely: around 2^34. So it's no surprise that each of your shortest surviving hashes encodes 36 bits, with just the right number of characters to go over 34 bits. (Hex characters, with 16 values, encode 4 bits. So 8 characters (8*4=32 bits) would be too few bits, while 9 characters (8*4=36) is just enough. Similarly base32 characters encode 6 bits, so 5 characters (5*6=30 bits) would be too few, while 6 characters (6*6=36 bits) is just enough.) So what technique should be used to create the sort of usually-unique, stable, short IDs you want? If you need to stick with the Python standard library, starting with MD5 is a defensible choice. Then if compactness in text representation matters, represent the hashes in base64. Then if you want to accept more collision-risk than is likely with the full 128 bits, use the birthday-problem calculations to pick a number of retained bits that fit your size/collision tradeoffs, and truncate away the rest. But remember MD5 is slower than hash functions that never intended to have cryptographic collision-resistance as a goal. So if speed is a goal, you could look outside the standard library. If you'd like to use the fast non-cryptographic hash functions that Google has published, look into their projects "CityHash" or "FarmHash". (These even have 256-bit options, on the off chance that the collision-risk with 128-bits is still too high for your application.) The first 36 (or 64 or 80 or 128) bits from these should be just as good as the bits from MD5, for your purposes. Then, apply the same encoding and truncation choices as may be appropriate.