How to extract the title of a PDF document from within a script for renaming?

Installing the package

This cannot be solved with plain Python. You will need an external package such as pdfrw, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip.

On Windows, first make sure you have a recent version of pip using the shell command:

python -m pip install -U pip

On Linux:

pip install -U pip

On both platforms, install then the pdfrw package using

pip install pdfrw

The code

I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:

import os
from pdfrw import PdfReader

path = r'C:\Users\YANN\Desktop'


def renameFileToPDFTitle(path, fileName):
    fullName = os.path.join(path, fileName)
    # Extract pdf title from pdf file
    newName = PdfReader(fullName).Info.Title
    # Remove surrounding brackets that some pdf titles have
    newName = newName.strip('()') + '.pdf'
    newFullName = os.path.join(path, newName)
    os.rename(fullName, newFullName)


for fileName in os.listdir(path):
    # Rename only pdf files
    fullName = os.path.join(path, fileName)
    if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
        continue
    renameFileToPDFTitle(path, fileName)

You can use pdfminer library to parse the PDFs. The info property contains the Title of the PDF. Here is what a sample info looks like :

[{'CreationDate': "D:20170110095753+05'30'", 'Producer': 'PDF-XChange Printer `V6 (6.0 build 317.1) [Windows 10 Enterprise x64 (Build 10586)]', 'Creator': 'PDF-XChange Office Addin', 'Title': 'Python Basics'}]`

Then we can extract the Title using the properties of a dictionary. Here is the whole code (including iterating all the files and renaming them):

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import os

start = "0000"

def convert(var):
    while len(var) < 4:
        var = "0" + var

    return var

for i in range(1,3622):
    var = str(i)
    var = convert(var)
    file_name = "a" + var + ".pdf"
    fp = open(file_name, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    fp.close()
    metadata = doc.info  # The "Info" metadata
    print metadata
    metadata = metadata[0]
    for x in metadata:
        if x == "Title":
            new_name = metadata[x] + ".pdf"
            os.rename(file_name,new_name)

What you need is a library that can actually read PDF files. For example pdfrw:

In [8]: from pdfrw import PdfReader

In [9]: reader = PdfReader('example.pdf')

In [10]: reader.Info.Title
Out[10]: 'Example PDF document'

How to extract the title of a PDF document from within a script for renaming?

Installing the package

The code

Tags:

Python

Pdf

File

Python 3.X

Related

Recent Posts