How can I distinguish a digitally-created PDF from a searchable PDF?
With PyMuPDF you can easily remove all text as is required for @ypnos' suggestion.
As an alternative, with PyMuPDF you can also check whether text is hidden in a PDF. In PDF's relevant "mini-language" this is triggered by the command 3 Tr
("text render mode", e.g. see page 402 of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf).
So if all text is under the influence of this command, then none of it will be rendered - allowing the conclusion "this is an OCR'ed page".
Modified this answer from How to check if PDF is scanned image or contains text
In this solution you don't have to render the pdf so I would guess it is faster. Basically the answer I modified used the percentage of the pdf area covered by text to determine if it is a text document or a scanned document (image).
I added a similar reasoning, calculating total area covered by images to calculate the percentage covered by images. If it is mostly covered by images you can assume it is scanned document. You can move the threshold around to fit your document collection.
I also added logic to check page by page. This is because at least in the document collection I have, some documents might have a digitally created first page and then the rest is scanned.
Modified code:
import fitz # pip install PyMuPDF
def page_type(page):
page_area = abs(page.rect) # Total page area
img_area = 0.0
for block in page.getText("RAWDICT")["blocks"]:
if block["type"] == 1: # Type=1 are images
bbox=block["bbox"]
img_area += (bbox[2]-bbox[0])*(bbox[3]-bbox[1]) # width*height
img_perc = img_area / page_area
print("Image area proportion: " + str(img_perc))
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # Rectangle where block text appears
text_area = text_area + abs(r)
text_perc = text_area / page_area
print("Text area proportion: " + str(text_perc))
if text_perc < 0.01: #No text = Scanned
page_type = "Scanned"
elif img_perc > .8: #Has text but very large images = Searchable
page_type = "Searchable text"
else:
page_type = "Digitally created"
return page_type
doc = fitz.open(pdffilepath)
for page in doc: #Iterate through pages to find different types
print(page_type(page))