Is it possible to parse MS Word using Apache POI and convert it into XML?

The purpose of HWPF subproject is exactly that: process Word files.

http://poi.apache.org/hwpf/index.html

Then, to convert the data to XML you have to build XML by the ususal ways: StAX, JDOM, XStream...

Apache offers a Quick Guide:

http://poi.apache.org/hwpf/quick-guide.html

and I also have found that:

http://sanjaal.com/java/tag/simple-java-tutorial-to-read-microsoft-document-in-java/

If you want to process docx files, you might want to look at the OpenXML4J subproject:

http://poi.apache.org/oxml4j/index.html

I'd say you have two options, both powered by Apache POI

One is to use Apache Tika. Tika is a text and metadata extraction toolkit, and is able to extract fairly rich text from Word documents by making appropriate calls to POI. The result is that Tika will give you XHTML style XML for the contents of your word document.

The other option is to use a class that was added fairly recently to POI, which is WordToHtmlConverter. This will turn your word document into HTML for you, and generally will preserve slightly more of the structure and formatting than Tika will.

Depending on the kind of XML you're hoping to get out, one of these should be a good bet for you. I'd suggest you try both against some of your sample files, and see which one is the best fit for your problem domain and needs.

Is it possible to parse MS Word using Apache POI and convert it into XML?

Tags:

Java

Ms Word

Apache Poi

Related

Recent Posts