Is there a Python module for converting RTF to plain text?
I've been working on a library called Pyth, which can do this:
http://pypi.python.org/pypi/pyth/
Converting an RTF file to plaintext looks something like this:
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter
doc = Rtf15Reader.read(open('sample.rtf'))
print PlaintextWriter.write(doc).getvalue()
Pyth can also generate RTF files, read and write XHTML, generate documents from Python markup a la Nevow's stan, and has limited experimental support for latex and pdf output. Its RTF support is pretty robust -- we use it in production to read RTF files generated by various versions of Word, OpenOffice, Mac TextEdit, EIOffice, and others.
OpenOffice has a RTF reader. You can use python to script OpenOffice, see here for more info.
You could probably try using the magic com-object on Windows to read anything that smells ms-binary. I wouldn't recommend that though.
Actually parsing the raw data probably won't be very hard, see this example written in .bat/QBasic.
DocFrac is a free open source converter betweeen RTF, HTML and text. Windows, Linux, ActiveX and DLL platforms available. It will probably be pretty easy to wrap it up in python.
RTF::TEXT::Converter - Perl extension for converting RTF into text. (in case You have problems withg DocFrac).
Official Rich Text Format (RTF) Specifications, version 1.7, by Microsoft.
Good luck (with the limited privileges in Your working environment).
If you are on Mac
, you can convert an RTF
file file.rtf
to TXT
from the CLI
like:
textutil -convert txt file.rtf