Best way to extract text from a Word doc without using COM/automation?
I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).
import os
def doc_to_text_catdoc(filename):
(fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
fi.close()
retval = fo.read()
erroroutput = fe.read()
fo.close()
fe.close()
if not erroroutput:
return retval
else:
raise OSError("Executing the command caused an error: %s" % erroroutput)
# similar doc_to_text_antiword()
The -w switch to catdoc turns off line wrapping, BTW.
(Same answer as extracting text from MS word files in python)
Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:
document = opendocx('Hello world.docx')
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
# Extract all text
print getdocumenttext(document)
See Python DocX site
100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs.