Anyway, on to today's post. And today's random topic shall be on why hashing algorithms are not unique identifiers.
Let's take the SHA-1 algorithm. It's a cryptographically secure hashing algorithm that produces a 160-bit output. Now, 2^160 is a stupidly large number: over 1.46×1048. To put that in perspective, if you multiplied that number by 100 the result is more than the number of atoms in the earth (estimated at 1.33×1050). That should be large enough to give a unique output for any input, right?
Let's say that you somehow generate every possible sequence of 20 bytes (all 2^160 of them), and feed each one in turn to SHA-1. You'd get a different 20-byte sequence output for each input - in fact, you'd get 2^160 different outputs (this does assume that SHA-1 can generate a different output for each 20-byte input - it may well not do this, in which case you've already got your collision). Now generate a 21-byte sequence, and hash that. The result will match one of your previous outputs, since there are only 2^160 possible outputs and you've already generated every single one.
Of course, doing such is completely impractical. That doesn't mean that hash collisions won't happen - they're still possible, since for every possible output there are an infinite number of inputs that will give that output. I'll grant that actually finding a collision is extremely improbable, but "extremely improbable" is not the same as "impossible" and does not mean "cannot happen tomorrow". The more inputs you try the higher the chance of a collision is as well, due to the birthday paradox - if one were able to generate 2^80 inputs (still completely impractical, but less impractical than 2^160 inputs), then the chance of a collision is 50%.
And that's why things like Mercurial and git make me slightly nervous, since that uses SHA-1 as a unique identifier and just assumes that a hash collision won't ever happen.