How to solve MemoryError using Python 3.7 pdf2image library?

I am a bit late to this, but the problem is indeed related to the 136 pages going into memory. You can do three things.

Specify a format for the converted images.

By default, pdf2image uses PPM as its image format, it is faster, but also takes a lot more memory (over 30MB per image!). What you can do to fix this is use a more memory-friendly format like jpeg or png.

convert_from_path('C:\path\to\your\pdf', fmt='jpeg')

That will probably solve the problem, but it's mostly just because of the compression, and at some point (say for +500pages PDF) the problem will reappear.

Use an output directory

This is the one I would recommend because it allows you to process any PDF. The example on the README page explains it well:

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('C:\path\to\your\pdf', output_folder=path)

This writes the image to your computer storage temporarily so you don't have to delete it manually. Make sure to do any processing you need to do before exiting the with context though!

Process the PDF file in chunks

pdf2image allows you to define the first an last page that you want to process. That means that in your case, with a PDF of 136 pages, you could do:

for i in range(0, 136 // 10 + 1):
    convert_from_path('C:\path\to\your\pdf', first_page=i*10, last_page=(i+1)*10)

Convert the PDF in blocks of 10 pages each time ( 1-10,11-20 and so on ... )

from pdf2image import pdfinfo_from_path,convert_from_path
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)

maxPages = info["Pages"]
for page in range(1, maxPages+1, 10) : 
   convert_from_path(pdf_file, dpi=200, first_page=page, last_page = min(page+10-1,maxPages))

The accepted answer has a small issue.

maxPages = pdf2image._page_count(pdf_file)

can no longer be used, as _page_count is deprecated. I found the working solution for the same.

from PyPDF2 import PdfFileWriter, PdfFileReader    
inputpdf = PdfFileReader(open(pdf, "rb"))
maxPages = inputpdf.numPages
for page in range(1, maxPages, 100):
    pil_images = pdf2image.convert_from_path(pdf, dpi=200, first_page=page,
                                                     last_page=min(page + 100 - 1, maxPages), fmt= 'jpg',
                                                     thread_count=1, userpw=None,
                                                     use_cropbox=False, strict=False)

This way, however large the file, it will process 100 at once and the ram usage is always minimal.

How to solve MemoryError using Python 3.7 pdf2image library?

Tags:

Python

Data Conversion

Python 3.X

Out Of Memory

Related

Recent Posts