How to check if PDF is scanned image or contains text

Try OCRmyPDF. You can use this command to convert a scanned pdf to digital pdf.

ocrmypdf input_scanned.pdf output_digital.pdf

If the input pdf is digital the command will throw an error "PriorOcrFoundError: page already has text!".

import subprocess as sp
import re

output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
   print("Uploaded scanned pdf")
else:
   print("Uploaded digital pdf")

The below code will work, to extract data text data from both searchable and non-searchable PDF's.

import fitz

text = ""
path = "Your_scanned_or_partial_scanned.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.getText()

If you don't have fitz module you need to do this:

pip install --upgrade pymupdf

def get_pdf_searchable_pages(fname):
    # pip install pdfminer
    from pdfminer.pdfpage import PDFPage
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num > 0:
        if len(searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is non-searchable")
        elif len(non_searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is searchable")
        else:
            print(f"searchable_pages : {searchable_pages}")
            print(f"non_searchable_pages : {non_searchable_pages}")
    else:
        print(f"Not a valid document")


if __name__ == '__main__':
    get_pdf_searchable_pages("1.pdf")
    get_pdf_searchable_pages("1Scanned.pdf")

Output:

Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable

Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.

You need to install fitz and PyMuPDF modules. You can do it by means of pip.

The following code has been tested with Python 3.7.9 and PyMuPDF 1.16.14. Moreover, it is important to install fitz BEFORE PyMuPDF, otherwise it provides some weird error about a missing frontend module (no idea why). So here is how I install the modules:

pip3 install fitz
pip3 install PyMuPDF==1.16.14

And here is the Python 3 implementation:

import fitz


def get_text_percentage(file_name: str) -> float:
    """
    Calculate the percentage of document that is covered by (searchable) text.

    If the returned percentage of text is very low, the document is
    most likely a scanned PDF
    """
    total_page_area = 0.0
    total_text_area = 0.0

    doc = fitz.open(file_name)

    for page_num, page in enumerate(doc):
        total_page_area = total_page_area + abs(page.rect)
        text_area = 0.0
        for b in page.getTextBlocks():
            r = fitz.Rect(b[:4])  # rectangle where block text appears
            text_area = text_area + abs(r)
        total_text_area = total_text_area + text_area
    doc.close()
    return total_text_area / total_page_area


if __name__ == "__main__":
    text_perc = get_text_percentage("my.pdf")
    print(text_perc)
    if text_perc < 0.01:
        print("fully scanned PDF - no relevant text")
    else:
        print("not fully scanned PDF - text is present")

Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them (e.g. this is the case for scanned PDFs processed by OCR sofware - such as pdfsandwich or Adobe Acrobat - that adds "invisible" text blocks on top of the image, so that you can select the text).

How to check if PDF is scanned image or contains text

Tags:

Python

Python 3.X

Pdfminer

Pypdf2

Pdf Extraction

Related

Recent Posts