How to extract text and text coordinates from a PDF file?
Newlines are converted to underscores in final output. This is the minimal working solution that I found.
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
# Open a PDF file.
fp = open('/Users/me/Downloads/test.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
def parse_obj(lt_objs):
# loop over the object list
for obj in lt_objs:
# if it's a textbox, print text and location
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
parse_obj(obj._objs)
# loop over all pages in the document
for page in PDFPage.create_pages(document):
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
parse_obj(layout._objs)
Here's a copy-and-paste-ready example that lists the top-left corners of every block of text in a PDF, and which I think should work for any PDF that doesn't include "Form XObjects" that have text in them:
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
fp = open('yourpdf.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)
for page in pages:
print('Processing next page...')
interpreter.process_page(page)
layout = device.get_result()
for lobj in layout:
if isinstance(lobj, LTTextBox):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
print('At %r is text: %s' % ((x, y), text))
The code above is based upon the Performing Layout Analysis example in the PDFMiner docs, plus the examples by pnj (https://stackoverflow.com/a/22898159/1709587) and Matt Swain (https://stackoverflow.com/a/25262470/1709587). There are a couple of changes I've made from these previous examples:
- I use
PDFPage.get_pages()
, which is a shorthand for creating a document, checking itis_extractable
, and passing it toPDFPage.create_pages()
- I don't bother handling
LTFigure
s, since PDFMiner is currently incapable of cleanly handling text inside them anyway.
LAParams
lets you set some parameters that control how individual characters in the PDF get magically grouped into lines and textboxes by PDFMiner. If you're surprised that such grouping is a thing that needs to happen at all, it's justified in the pdf2txt docs:
In an actual PDF file, text portions might be split into several chunks in the middle of its running, depending on the authoring software. Therefore, text extraction needs to splice text chunks.
LAParams
's parameters are, like most of PDFMiner, undocumented, but you can see them in the source code or by calling help(LAParams)
at your Python shell. The meaning of some of the parameters is given at https://pdfminer-docs.readthedocs.io/pdfminer_index.html#pdf2txt-py since they can also be passed as arguments to pdf2text
at the command line.
The layout
object above is an LTPage
, which is an iterable of "layout objects". Each of these layout objects can be one of the following types...
LTTextBox
LTFigure
LTImage
LTLine
LTRect
... or their subclasses. (In particular, your textboxes will probably all be LTTextBoxHorizontal
s.)
More detail of the structure of an LTPage
is shown by this image from the docs:
Each of the types above has a .bbox
property that holds a (x0, y0, x1, y1) tuple containing the coordinates of the left, bottom, right, and top of the object respectively. The y-coordinates are given as the distance from the bottom of the page. If it's more convenient for you to work with the y-axis going from top to bottom instead, you can subtract them from the height of the page's .mediabox
:
x0, y0_orig, x1, y1_orig = some_lobj.bbox
y0 = page.mediabox[3] - y1_orig
y1 = page.mediabox[3] - y0_orig
In addition to a bbox
, LTTextBox
es also have a .get_text()
method, shown above, that returns their text content as a string. Note that each LTTextBox
is a collection of LTChar
s (characters explicitly drawn by the PDF, with a bbox
) and LTAnno
s (extra spaces that PDFMiner adds to the string representation of the text box's content based upon the characters being drawn a long way apart; these have no bbox
).
The code example at the beginning of this answer combined these two properties to show the coordinates of each block of text.
Finally, it's worth noting that, unlike the other Stack Overflow answers cited above, I don't bother recursing into LTFigure
s. Although LTFigure
s can contain text, PDFMiner doesn't seem capable of grouping that text into LTTextBox
es (you can try yourself on the example PDF from https://stackoverflow.com/a/27104504/1709587) and instead produces an LTFigure
that directly contains LTChar
objects. You could, in principle, figure out how to piece these together into a string, but PDFMiner (as of version 20181108) can't do it for you.
Hopefully, though, the PDFs you need to parse don't use Form XObjects with text in them, and so this caveat won't apply to you.