Hashing algorithms ≠ unique identifiers, mmkay? 
[Wednesday 2nd November 2011 at 11:56 pm]
Thomas

I see talismancer has just managed to sneak in a post for yesterday. I can't really mock him for that one, seeing as I did the same several times during last year's run (and in fact while last night's post has a timestamp of 11:59:59 it was actually submitted a few seconds later).
Anyway, on to today's post. And today's random topic shall be on why hashing algorithms are not unique identifiers.
Let's take the SHA1 algorithm. It's a cryptographically secure hashing algorithm that produces a 160bit output. Now, 2^160 is a stupidly large number: over 1.46×10^{48}. To put that in perspective, if you multiplied that number by 100 the result is more than the number of atoms in the earth (estimated at 1.33×10^{50}). That should be large enough to give a unique output for any input, right?
Wrong.
Let's say that you somehow generate every possible sequence of 20 bytes (all 2^160 of them), and feed each one in turn to SHA1. You'd get a different 20byte sequence output for each input  in fact, you'd get 2^160 different outputs (this does assume that SHA1 can generate a different output for each 20byte input  it may well not do this, in which case you've already got your collision). Now generate a 21byte sequence, and hash that. The result will match one of your previous outputs, since there are only 2^160 possible outputs and you've already generated every single one.
Of course, doing such is completely impractical. That doesn't mean that hash collisions won't happen  they're still possible, since for every possible output there are an infinite number of inputs that will give that output. I'll grant that actually finding a collision is extremely improbable, but "extremely improbable" is not the same as "impossible" and does not mean "cannot happen tomorrow". The more inputs you try the higher the chance of a collision is as well, due to the birthday paradox  if one were able to generate 2^80 inputs (still completely impractical, but less impractical than 2^160 inputs), then the chance of a collision is 50%.
And that's why things like Mercurial and git make me slightly nervous, since that uses SHA1 as a unique identifier and just assumes that a hash collision won't ever happen. 

