python convert microsoft office docs to plain text on linux

The usual tool for converting Microsoft Office documents to HTML or other formats was mswordview, which has since been renamed to vwWare.

If you're looking for a command-line tool, they actually recommend using AbiWord to perform the conversion:

AbiWord --to=txt

If you're looking for a library, start on the wvWare overview page. They also maintain a list of libraries and tools which read MS Office documents.


I'd go for the command line-solution (and then use the Python subprocess module to run the tools from Python).

Convertors for msword (catdoc), excel (xls2csv) and ppt (catppt) can be found (in source form) here: http://vitus.wagner.pp.ru/software/catdoc/.

Can't really comment on the usefullness of catppt but catdoc and xls2csv work great!

But be sure to first search your distributions repositories... On ubuntu for example catdoc is just one fast apt-get away.


You can access OpenOffice via Python API.

Try using this as a base: http://wiki.services.openoffice.org/wiki/Odt2txt.py