How to create PDF with scanned pages but selectable text?
This has (contrary to some other answers here) most probably nothing to do with Acrobat at all.
Most (all?!) professional document scanners and most semi-professional ones will automatically perform OCR when you choose "Save as PDF" and have the "searchable" checkbox ticked in the settings. The cheaper "consumer grade" models will do the OCR on the attached PC, typical network scanners do it internally.
The word "searchable" means nothing more and nothing less than that the scanner will perform OCR, then generate a page with the scanned bitmaps within, and overlay them with invisible characters from the OCR, each placed over the respective character on the bitmap.
That way, you can search, and also select, copy, and paste the "bitmap" as if by magic. It's no magic at all, however. In reality, you're just copying invisible text.
The scanner may also do some additional magic such as compositing the large image from many small tiles which are also reused. This results in a much smaller document size than would actually be possible, but may also lead to funny surprises (not so funny if they happen to you!) such as the Xerox alters your bills story, ironically even when no OCR is done, depending on the firmware.
But how is this possible?
Basically, a program performs OCR on the input file and then it places an invisible layer of text over the picture. Alternatively, it might also place a visible layer of text under the picture, giving the same effect.
When you select something, the picture doesn't matter because the text layer gets selected.
how can this be created?
There are several ways. Given that Acrobat has already been suggested, I will add some free options (and luckily you are not forced to have Windows to use them).
PDF-XChange Viewer
This is a native Windows program by Tracker Software. The freeware version runs fine under Wine if you use the 32-bit edition in a 32-bit prefix, therefore you can use it on Windows, macOS and Linux. In the last two cases, you would need PlayOnMac or PlayOnLinux respectively.
Here's a picture from this answer I left on Ask Ubuntu:
OCRmyPDF
This is a multiplatform program written in Python, based on Ghostscript, Tesseract and Unpaper. From the docs:
What OCRmyPDF does
OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content. It uses Ghostscript to rasterize the page, and then performs on OCR on the rasterized image to create an OCR “layer”. The layer is then grafted back onto the original PDF.
It can be easily installed on Debian and Ubuntu derivatives:
apt-get install ocrmypdf
Or on macOS:
brew tap jbarlow83/ocrmypdf
brew install ocrmypdf
On Windows you would need to use the Docker image. See the official docs for details.
Usage is very simple and I suggest you use the optional -d
(deskew) and -c
(clean) parameters for better results. It will straighten every page and clean up small dots/imperfections before running the OCR process.
You can (and should) provide the language with -l
.
Here's an example taken from this skewed document written in Italian:
The command I used was:
ocrmypdf -l ita -d -c input.pdf output.pdf
Online tools
There are a few online tools that do the same. Notable, PDF24 hosts a free web-based version of OCRmyPDF that can be used without limitations.
See also:
- ocr.space
- Cvision online OCR
- LeadTools JS based demo with OCR
This is possibly because of a Acrobat OCR feature:
Acrobat can recognize text in any PDF or image file in dozens of languages. All you have to do is open the scanned document or image that you'd like to OCR, then click the blue Tools button in the top right of the toolbar. In that sidebar, select the Recognize Text tab, then click the In This File button.
...
With the text recognized, you can now markup the PDF using all the normal markup tools — you can highlight, cross out text, and more. You can even copy the text with the detected formatting, though that's often less accurate than the text recognition itself.