Hashes and probability
Christmas Cheer
Due to my family being made up generally of quiet people, we sort of all desserted the sitting rooom after the xmas meal , and went and did our own thing, well we pop back seperately at times to watch different stuff on the tv and keep Gran company. But I think most of us weere just looking for some peasce to read or whatever , and while there was some good stuff on TV there wasn't much of it. And since the TV was on for Gran - well some of us left her to it. I ending up in the home office reading up on SCM systems, as I'm got a idea presented later about a bug tracker and wanted to be sure it would work on most SCM. In hindsight I think it has to, but I'm being cautious.
Hashes and Probability
Anyway as a result I was reading the monotone documentation, Monotone is another DSCM system and looks quite cool, although shares a lot with git , as far as I can tell, but is probably slightly easier
to learn.
Anyway monotone has a section in it manual which tries (and to some extent succeds) to defend 'compare-by-hash' . Now for those of you who don't know compare by hash is the technique of usings Cryptographic hashes to todo comparisions, Rsync for one uses this to good effect , but both git and monotone use it to name file revisions. So if two files every have the same hash the system breaks down because it can't tell them apart.
The question is how likely is this ? And it obviously depends the amount of data stored - specifically the number of hashed objects. While the defenders are correct that it not very likely I'm a little woried they seem to useing the 50% threshold in there quotes, eg, the number of hashes required for a there to be a 50% chance of collision.
I think I'd prefer it it we culd standards on say the 5% and 1% level similiar to other statistcially uses.
Having said that bc suggests for SHA-1 thats still not as bad as it
sounds:- (Approximation formula from wikipedia - I couldn't find it on mathworld )
So I'm reasonably happy still with this technique
I'll blog later on SCMs and bugs and maybe more on this too.
sqrt(2*(2^160)*l(1/(1-0.05)))
387208558096282571848629.44632292079366906761
Which I make to be about 3 * 10^42 objects, and only
about half that for the 1% level.