How to extract the title of a PDF document from within a script for renaming?
Installing the package
This cannot be solved with plain Python. You will need an external package such as pdfrw
, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip
.
On Windows, first make sure you have a recent version of pip
using the shell command:
python -m pip install -U pip
On Linux:
pip install -U pip
On both platforms, install then the pdfrw
package using
pip install pdfrw
The code
I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:
import os
from pdfrw import PdfReader
path = r'C:\Users\YANN\Desktop'
def renameFileToPDFTitle(path, fileName):
fullName = os.path.join(path, fileName)
# Extract pdf title from pdf file
newName = PdfReader(fullName).Info.Title
# Remove surrounding brackets that some pdf titles have
newName = newName.strip('()') + '.pdf'
newFullName = os.path.join(path, newName)
os.rename(fullName, newFullName)
for fileName in os.listdir(path):
# Rename only pdf files
fullName = os.path.join(path, fileName)
if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
continue
renameFileToPDFTitle(path, fileName)
You can use pdfminer library to parse the PDFs. The info property contains the Title of the PDF. Here is what a sample info looks like :
[{'CreationDate': "D:20170110095753+05'30'", 'Producer': 'PDF-XChange Printer `V6 (6.0 build 317.1) [Windows 10 Enterprise x64 (Build 10586)]', 'Creator': 'PDF-XChange Office Addin', 'Title': 'Python Basics'}]`
Then we can extract the Title using the properties of a dictionary. Here is the whole code (including iterating all the files and renaming them):
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import os
start = "0000"
def convert(var):
while len(var) < 4:
var = "0" + var
return var
for i in range(1,3622):
var = str(i)
var = convert(var)
file_name = "a" + var + ".pdf"
fp = open(file_name, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fp.close()
metadata = doc.info # The "Info" metadata
print metadata
metadata = metadata[0]
for x in metadata:
if x == "Title":
new_name = metadata[x] + ".pdf"
os.rename(file_name,new_name)
What you need is a library that can actually read PDF files. For example pdfrw:
In [8]: from pdfrw import PdfReader
In [9]: reader = PdfReader('example.pdf')
In [10]: reader.Info.Title
Out[10]: 'Example PDF document'