Is there a simple way to identify if a PDF is scanned?
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("\n\nFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03\'00\'',
ModDate: 'D:20140709104225-03\'00\'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:\web\so-odpovedi\pdf\BR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:\xxxx\pdf>node detect_scanned.js scanned
D:\xxxx\pdf\AR-G1002-scanned.pdf
D:\xxxx\pdf\AR-G1002_scanned.pdf
D:\xxxx\pdf\BR-L1411-3-scanned.pdf
D:\xxxx\pdf\WHO_TRS_696-scanned.pdf
D:\xxxx\pdf>node detect_scanned.js
D:\xxxx\pdf\AR-G1003-not-scanned.pdf
D:\xxxx\pdf\ASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:\xxxx\pdf\MULTIMODE ABSORBER-not-scanned.pdf
D:\xxxx\pdf\ReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.