How to turn a pdf into a text searchable pdf?
As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run
sudo apt install ocrmypdf
ocrmypdf -h # to see the usage
Finally you can OCR your pdf with the command:
ocrmypdf input.pdf output.pdf # change input and output to the files you want
If it seems the command is unresponsive, you can increase the verbosity using the -v
flag (which can be used incrementally as -vv
or -vvv
). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:
pdftk A=input.pdf cat A1-5 output output.pdf
If you have any question have a look in the new Github Repo.
@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).
sudo apt install ocrmypdf
Then you have to install the tesseract languages you need.
To list which languages are already in your system, type:
tesseract --list-langs
In case you miss one, install it. For instance,
sudo apt install tesseract-ocr-spa
Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command
ocrmypdf -l 'spa' old.pdf new.pdf
You can, of course, check its man page for some additional options.
pdfsandwich
performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.
If you have a scanned file scanned_file.pdf
, simply call
pdfsandwich scanned_file.pdf
which generates the file scanned_file_ocr.pdf
with the recognized text added to the scanned pages.
Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.
DISCLAIMER: I'm the developer of pdfsandwich
and therefore heavily biased.