Convert Word doc to HTML programmatically in Java

If its a docx, you could use docx4j (ASL v2). This uses XSLT to create the HTML.

However, it will give you a single HTML for the whole document.

If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it).


I've used the following approach successfully in production systems where the new MS Word XML format isn't available:

Spawn a process that does something similar to:

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).

The other option is to spawn the following sort of command every time you need to do the conversion:

ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>"

I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).

While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.

I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.


I recommend the JODConverter, It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.

JODConverter has a lot of documents, scripts, and tutorials to help you out.


We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.

Tags:

Html

Java

Ms Word