Log in

No account? Create an account
'Twas brillig, and the slithy toves did gyre and gimble in the wabe [entries|archive|friends|userinfo]

[ website | Beware the Jabberwock... ]
[ deviantArt | the-boggyb ]
[ FanFiction | Torkell ]
[ Tumblr | torkellr ]

[Random links| BBC news | Vulture Central | Slashdot | Dangerous Prototypes | LWN | Raspberry Pi]
[Fellow blogs| a Half Empty Glass | the Broken Cube | The Music Jungle | Please remove your feet | A letter from home]
[Other haunts| Un4seen Developments | Jazz 2 Online | EmuTalk.net | Feng's shui]

Hashing algorithms ≠ unique identifiers, mmkay? [Wednesday 2nd November 2011 at 11:56 pm]

[Tags|, , , ]
[Playing |Harry's Game ~ Ryan and Rachel O'Donnell/The Celtic Chillout Album - Disc 1]

I see talismancer has just managed to sneak in a post for yesterday. I can't really mock him for that one, seeing as I did the same several times during last year's run (and in fact while last night's post has a timestamp of 11:59:59 it was actually submitted a few seconds later).

Anyway, on to today's post. And today's random topic shall be on why hashing algorithms are not unique identifiers.

Let's take the SHA-1 algorithm. It's a cryptographically secure hashing algorithm that produces a 160-bit output. Now, 2^160 is a stupidly large number: over 1.46×1048. To put that in perspective, if you multiplied that number by 100 the result is more than the number of atoms in the earth (estimated at 1.33×1050). That should be large enough to give a unique output for any input, right?


Let's say that you somehow generate every possible sequence of 20 bytes (all 2^160 of them), and feed each one in turn to SHA-1. You'd get a different 20-byte sequence output for each input - in fact, you'd get 2^160 different outputs (this does assume that SHA-1 can generate a different output for each 20-byte input - it may well not do this, in which case you've already got your collision). Now generate a 21-byte sequence, and hash that. The result will match one of your previous outputs, since there are only 2^160 possible outputs and you've already generated every single one.

Of course, doing such is completely impractical. That doesn't mean that hash collisions won't happen - they're still possible, since for every possible output there are an infinite number of inputs that will give that output. I'll grant that actually finding a collision is extremely improbable, but "extremely improbable" is not the same as "impossible" and does not mean "cannot happen tomorrow". The more inputs you try the higher the chance of a collision is as well, due to the birthday paradox - if one were able to generate 2^80 inputs (still completely impractical, but less impractical than 2^160 inputs), then the chance of a collision is 50%.

And that's why things like Mercurial and git make me slightly nervous, since that uses SHA-1 as a unique identifier and just assumes that a hash collision won't ever happen.
Link | Previous Entry | Share | Next Entry[ 2 pennies | Penny for your thoughts? ]

[User Picture]From: delta_mike
Thursday 3rd November 2011 at 8:05 am (UTC)
The table on that Wikipedia article, "Probability Table", is interesting for assessing the probability of such a failure.

Assuming a given hash is a perfect cryptographic hash, then, it shows that if you had 2.6 × 1010 objects in a given set hashed to a 128bit string, then you'd have a collision occur with probability 10-18.

To get up to a 1:1000 chance of a collision, you'd need 8.3 × 1017 hashed objects.

As a source of possible failure, it's really damn unlikely. Designing for hash agility, and for being able to cope in this case is -desirable-, but there really does seem to be negligible risk involved in assuming uniqueness.
(Reply) (Thread)
[User Picture]From: boggyb
Thursday 3rd November 2011 at 7:23 pm (UTC)
I think part of my issue with it is the question of what does the software do? The impression I get from reading the discussion they had on the Mercurial list is that they don't care because it's very unlikely, and so should you get a collision the result is undefined behaviour.

The other issue I have with the whole head-in-the-sand approach is that it's not hard to work around this. Off the top of my head you could append a sequence number to your hash, you could have a collision chain and walk the list looking for your exact data, or you could add a dummy field to the data and modify that until you don't get a collision. Admitidly they all have limitations, but they're very deterministic limitations (like a fixed upper limit on the number of supported collisions). I think git even has a compile-time option for one of these.

Or as an alternative, one could use an identifier that's designed to be unique. A GUID is a good example (though that does assume everyone is playing by the rules and not doing naughty things like reusing MAC addresses), or you could derive it from your public IP address at which point all you need to append is a locally unique identifier.

Random aside: we have a piece of software that will handle upwards of 4000 messages per second. A quick back-of-the-envelope calculation makes that about 1.5×1011 messages per year. Run it on a million servers, and there's your 1017 :)
(Reply) (Parent) (Thread)