Tools to extract text from powerpoint pptx in linux?
If you can process the files in bash
, this one-liner will unpack all the text:
unzip -qc "$1" ppt/slides/slide*.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)'
Just pass it the pptx file as $1
, and it will write the text into file $2
. The content of each slide will not appear in presentation order, and there will be no labels or anything, so you'll need a few more lines of script and a temp directory to get a more readable listing.
Since you have Abiword installed you can just make a PDF first
libreoffice --headless --convert-to pdf filename.pptx
And then use abiword to convert the pdf to txt
abiword --to=txt filename.pdf