This makes me worried about hash collisions as well. The article implies that a file whose hash matches something they already have will never even reach their servers - so presumably I just have to keep my fingers crossed that the file they're synchronising to all my machines is the one I uploaded, and not some other user's completely different file that happens to have the same hash?
SHA256, which Dropbox uses, has around 10^77 possible hashes. That's 100 trillion quadrillion quadrillion quadrillion quadrillion possible values. So I wouldn't worry about hash collisions if I were you. If a hash collision happened easily in SHA256, that would be very big news for the security community, and much more serious services than Dropbox would be affected.
This is a common sentiment, but not really sensible. If you want to store (for example) another 64 bits worth of information, you would always be better off with 64 more bits of some strong hash than 64 bits of filesize.
...and if you took the same number of bits you were using to store the filesize, and instead stored that many bits of some independent secure hash, it'd be harder still.
Of course every extra bit that has to be matched makes collisions 'harder' but length bits are much weaker than other options, except insofar as they may already be available for other reasons.
And that was written before the md5 collisions were discovered. And no collision has yet been discovered for md5 for files of the same length, they are all extension attacks...
Absolutely false. Some MD5 collision-generators specifically find pairs of equally-lengthed inputs with the same hash. See for example hit #2 for [MD5 collisions]:
'Extension attacks' are something else, which let you turn one collision into more, or create valid hashes for combinations of unknown text plus a chosen extension – not find an initial collision. See:
The 'length extension' property can be helpful, once you find a collision based on 'random' nonsense, in extending that into two documents that are each meaningful-but-different and still colliding, as was done in this 2005 MD5 collision demonstration:
But still possible. Whenever the number of bits in a file is more than the number of bits in a hash, there are collisions. They could use the Chinese Remainder Theorem, but that would only go so far (maybe far enough to remove substantial doubt? The link below seems to suggest so.)
At first, that was what I thought the flaw would be -- providing a file that has a hash that collides with another file, gets you that file.
But it seems to me you would need to know the exact contents of the file in question to get that to happen, making the point moot. Perhaps I'm wrong on that.
> What are the odds of a hash collision + identical filesize?
If implemented correctly, the additional constraint on filesize being the same is irrelevant. Given one particular hash value, the probability that a second file hashes to the same value is 1/(range of hash function) if the hash function is modeled as an ideal hash function.
Given that they do chunks of updates, rather than the whole file, presumably they check multiple hashes to ensure this doesn't happen. I could be wrong, though.