How to know if a PDF contains only images or has been OCR scanned for searching?
Scannned images converted to PDF which have been OCR'ed in the aftermath to make text searchable do normally contain the text parts rendered as "invisible". So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the invisible text.
I'd recommend you to look at the XPDF-derived commandline tools pdffonts(.exe)
, pdfinfo(.exe)
and pdftotext(.exe)
. See here for downloads: http://www.foolabs.com/xpdf/download.html
Example usage of pdffonts
:
C:\downloads\> pdffonts cisco-ip-phone-7911-guide6.1.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
LGOKFL+Univers-BlackOblique Type 1C yes yes no 13171 0
LGOKGM+Univers-Black Type 1C yes yes no 13172 0
[....]
This PDF uses fonts (indicated by the 'name' column), has them embedded (indicated by the 'yes' in the 'emb' column) and uses subset fonts (indicated by the 'yes' in the 'sub' column).
C:\downloads\> pdffonts examle1.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Univers-BlackOblique Type 1C yes no no 14 0
Arial TrueType no no no 15 0
This PDF uses 2 fonts (indicated by the 'name' column). The font 'Universe-BlackOblique' is embedded completely (indicated by the 'yes' in the 'emb' column and the 'no' in the 'sub' column). The font 'Arial' is also used, but is not embedded.
C:\downloads\> pdffonts examle2.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
This PDF uses not a single font, and hence does not have any text embedded (so no OCR either).
Example usage of pdftotext
:
C:\downloads\> pdftotext ^
-layout ^
cisco-ip-phone-7911-guide6.1.pdf ^
cisco-ip-phone-7911-guide6.1.txt
This will extract all text strings from the PDF (trying to preserve some resemblance of the original layout). If there is no text in the PDF, you'd know there was no OCR...