Extracting text from garbled PDF

I had the same problem. Uploading it to Google Drive, opening with Google Docs and copying the text from there worked for me.

I went to a lot of people for help and OCR is the only solution to this problem

Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes).

For example, Distiller produces such files when "Smallest File Size" preset is used.

Other than OCR there is no other way to retrieve text from such files, I'm afraid. We recently published a guide for how to OCR PDFs in .NET.

Also we have a sample code that shows how to perform OCR for unmapped characters and then replace them with correct Unicode values.

Supplementing the original answer

The original answer mentioned the "information about meaning of used glyphs/shapes". This information should be contained in a PDF structure called a /ToUnicode table. Such a table is required for each and every font which is embedded as a subset and uses non-standard (Custom) encoding.

In order to quickly evaluate the chances for extractability of text contents, you can use the pdffonts command line utility. This prints in tabular form a series of items about each font used by the PDF. The presence of a /ToUnicode table is indicated by column headed uni.

A few example outputs:

$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes yes     12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes yes     13  0


$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes no      12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes no      13  0


$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes yes     12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes no      13  0

The good.pdf lets you extract the text contents for both fonts correctly, because both fonts have an accompanying /ToUnicode table.

For the bad1.pdf and the bad2.pdf the text extraction succeeds only for one of the two fonts, and fails for the other, because only one font has a /ToUnicode table.

I, Kurt Pfeifle, have recently created a series of hand-coded PDF files to demonstrate the influence of existing, buggy, manipulated or missing /ToUnicode tables in the PDF source code. These PDFs are extensively-commented and suitable to be explored with the help of a text editor. Above pdffonts output examples were created with the help of these hand-coded files. (There are a few more PDFs showing different results, which an interested reader may want to explore...)

Extracting text from garbled PDF

Supplementing the original answer

Tags:

Pdf

File Format

Text Analysis

Related

Recent Posts