This makes me worried about hash collisions as well. The article implies that a ...

thomaswmeyer · on April 12, 2011

SHA256, which Dropbox uses, has around 10^77 possible hashes. That's 100 trillion quadrillion quadrillion quadrillion quadrillion possible values. So I wouldn't worry about hash collisions if I were you. If a hash collision happened easily in SHA256, that would be very big news for the security community, and much more serious services than Dropbox would be affected.

ceejayoz · on April 12, 2011

Deduping with hash and filesize in bytes should make a collision unlikely enough.

gojomo · on April 12, 2011

This is a common sentiment, but not really sensible. If you want to store (for example) another 64 bits worth of information, you would always be better off with 64 more bits of some strong hash than 64 bits of filesize.

slackerIII · on April 12, 2011

I think the point is that creating a collision when data + fileSize is hashed is much harder than just creating a collision when hashing data alone. Raymond explains it well: http://blogs.msdn.com/b/oldnewthing/archive/2004/05/19/13493...

gojomo · on April 13, 2011

...and if you took the same number of bits you were using to store the filesize, and instead stored that many bits of some independent secure hash, it'd be harder still.

Of course every extra bit that has to be matched makes collisions 'harder' but length bits are much weaker than other options, except insofar as they may already be available for other reasons.

justincormack · on April 12, 2011

And that was written before the md5 collisions were discovered. And no collision has yet been discovered for md5 for files of the same length, they are all extension attacks...

gojomo · on April 13, 2011

Absolutely false. Some MD5 collision-generators specifically find pairs of equally-lengthed inputs with the same hash. See for example hit #2 for [MD5 collisions]:

http://www.mscs.dal.ca/~selinger/md5collision/

'Extension attacks' are something else, which let you turn one collision into more, or create valid hashes for combinations of unknown text plus a chosen extension – not find an initial collision. See:

http://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_con...

The 'length extension' property can be helpful, once you find a collision based on 'random' nonsense, in extending that into two documents that are each meaningful-but-different and still colliding, as was done in this 2005 MD5 collision demonstration:

http://replay.waybackmachine.org/20050612011328/http://www.c...

sukuriant · on April 12, 2011

But still possible. Whenever the number of bits in a file is more than the number of bits in a hash, there are collisions. They could use the Chinese Remainder Theorem, but that would only go so far (maybe far enough to remove substantial doubt? The link below seems to suggest so.)

Relavent to the discussion: http://stackoverflow.com/questions/622930/purposely-create-t...

schwanksta · on April 12, 2011

At first, that was what I thought the flaw would be -- providing a file that has a hash that collides with another file, gets you that file.

But it seems to me you would need to know the exact contents of the file in question to get that to happen, making the point moot. Perhaps I'm wrong on that.

avdempsey · on April 12, 2011

Do they check filesize too? What are the odds of a hash collision + identical filesize? We might need Carl Sagan to answer that one.

bdhe · on April 12, 2011

> What are the odds of a hash collision + identical filesize?

If implemented correctly, the additional constraint on filesize being the same is irrelevant. Given one particular hash value, the probability that a second file hashes to the same value is 1/(range of hash function) if the hash function is modeled as an ideal hash function.

Groxx · on April 12, 2011

Given that they do chunks of updates, rather than the whole file, presumably they check multiple hashes to ensure this doesn't happen. I could be wrong, though.