Merge PDF files

Merge all pdf files that are present in a dir

Put the pdf files in a dir. Launch the program. You get one pdf with all the pdfs merged.

import os
from PyPDF2 import PdfFileMerger

x = [a for a in os.listdir() if a.endswith(".pdf")]

merger = PdfFileMerger()

for pdf in x:
    merger.append(open(pdf, 'rb'))

with open("result.pdf", "wb") as fout:
    merger.write(fout)

How would I make the same code above today

from glob import glob
from PyPDF2 import PdfFileMerger



def pdf_merge():
    ''' Merges all the pdf files in current directory '''
    merger = PdfFileMerger()
    allpdfs = [a for a in glob("*.pdf")]
    [merger.append(pdf) for pdf in allpdfs]
    with open("Merged_pdfs.pdf", "wb") as new_file:
        merger.write(new_file)


if __name__ == "__main__":
    pdf_merge()

You can use PyPdf2s PdfMerger class.

File Concatenation

You can simply concatenate files by using the append method.

from PyPDF2 import PdfMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("result.pdf")
merger.close()

You can pass file handles instead file paths if you want.

File Merging

If you want more fine grained control of merging there is a merge method of the PdfMerger, which allows you to specify an insertion point in the output file, meaning you can insert the pages anywhere in the file. The append method can be thought of as a merge where the insertion point is the end of the file.

e.g.

merger.merge(2, pdf)

Here we insert the whole pdf into the output but at page 2.

Page Ranges

If you wish to control which pages are appended from a particular file, you can use the pages keyword argument of append and merge, passing a tuple in the form (start, stop[, step]) (like the regular range function).

e.g.

merger.append(pdf, pages=(0, 3))    # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5

If you specify an invalid range you will get an IndexError.

Note: also that to avoid files being left open, the PdfFileMergers close method should be called when the merged file has been written. This ensures all files are closed (input and output) in a timely manner. It's a shame that PdfFileMerger isn't implemented as a context manager, so we can use the with keyword, avoid the explicit close call and get some easy exception safety.

You might also want to look at the pdfcat script provided as part of pypdf2. You can potentially avoid the need to write code altogether.

The PyPdf2 github also includes some example code demonstrating merging.

PyMuPdf

Another library perhaps worth a look is PyMuPdf. Merging is equally simple.

From command line:

python -m fitz join -o result.pdf file1.pdf file2.pdf file3.pdf

and from code

import fitz

result = fitz.open()

for pdf in ['file1.pdf', 'file2.pdf', 'file3.pdf']:
    with fitz.open(pdf) as mfile:
        result.insertPDF(mfile)
    
result.save("result.pdf")

With plenty of options, detailed in the projects wiki.


Use Pypdf or its successor PyPDF2:

A Pure-Python library built as a PDF toolkit. It is capable of:

  • splitting documents page by page,
  • merging documents page by page,

(and much more)

Here's a sample program that works with both versions.

#!/usr/bin/env python
import sys
try:
    from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
    from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
    input_streams = []
    try:
        # First open all the files, then produce the output file, and
        # finally close the input files. This is necessary because
        # the data isn't read from the input files until the write
        # operation. Thanks to
        # https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
        for input_file in input_files:
            input_streams.append(open(input_file, 'rb'))
        writer = PdfFileWriter()
        for reader in map(PdfFileReader, input_streams):
            for n in range(reader.getNumPages()):
                writer.addPage(reader.getPage(n))
        writer.write(output_stream)
    finally:
        for f in input_streams:
            f.close()
        output_stream.close()

if __name__ == '__main__':
    if sys.platform == "win32":
        import os, msvcrt
        msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    pdf_cat(sys.argv[1:], sys.stdout)