I used to be a developer for some technical websites (publishing articles, blogs, newsletters, that sort of thing), mainly around SQL Server and .NET. Almost all of the authors used Microsoft Word, the main value being familiarity and using Track Changes. I came to appreciate Track Changes when I wrote or edited the occasional article, and I can't imagine it would be easy to find a replacement for the functionality that would work smoothly enough to justify convincing all the authors and editors to switch. Fortunately, the authors and editors were pretty good about using the house style in Word, meaning that the articles were semantically marked up. I wrote a converter that did a pretty good job of generating clean HTML from the source Word, mainly by mapping Word styles to the corresponding pieces of HTML.
One of the more interesting challenges was the mismatch between the structure of Word docx files and HTML. Whereas HTML has plenty of nesting, docx files have a comparatively flat structure. For instance, suppose you want a nested list of depth 2 that looks like this:
* Outer
* Inner
One way a docx (XML) file can represent this is a paragraph element with the style "Bullet1" (and the text "Outer"), followed by a paragraph element with the style "Bullet2" (and the text "Inner"). The two elements are completely separate, and you have to infer that they're part of the same list, and that the second bullet is the child of the first. Once you've done that, you can generate the corresponding HTML, which has the inner list as a child of the outer list element.
One of the more interesting challenges was the mismatch between the structure of Word docx files and HTML. Whereas HTML has plenty of nesting, docx files have a comparatively flat structure. For instance, suppose you want a nested list of depth 2 that looks like this:
One way a docx (XML) file can represent this is a paragraph element with the style "Bullet1" (and the text "Outer"), followed by a paragraph element with the style "Bullet2" (and the text "Inner"). The two elements are completely separate, and you have to infer that they're part of the same list, and that the second bullet is the child of the first. Once you've done that, you can generate the corresponding HTML, which has the inner list as a child of the outer list element.If anybody's interested, I wrote a Word to HTML converter for node.js using the same ideas: https://github.com/mwilliamson/mammoth.js