Given the set of valid words in a document is (by definition) a subset of the wo...

Given the set of valid words in a document is (by definition) a subset of the words in the dictionary, I'd imagine you're better off holding the document in memory, building a datastructure from that and then looking through the disk for the set of words in the document that weren't found in the dictionary.