Copy pdf text layer to another pdf

This answer on stackoverflow has a solution. You can extract the text with coordinates from your pdf-2 using pdftotext -bbox or the Python package PDFMiner, then write this hidden text into a new PDF with the Python package ReportLab, then merge this hidden-text PDF with your pdf-1 using PDFtk (There's a GUI for Windows at the webpage; the command line for Unix is called PDFtk Server now.)

Or, you could try directly merging pdf-1 and pdf-2 using PDFtk. Run pdftk pdf-2 multistamp pdf-1 output out.pdf. This will put each page of pdf-1 in front of the corresponding page of pdf-2, so you will only see the images from pdf-1 (assuming they are scans, and do not have a transparent background), but the hidden text from pdf-2 will be included. The downside is that this may be very large, since it will include two copies of each page image. I have verified that this works, and the size of the output pdf is the sum of the sizes of the inputs.

Here's a simple shell script to do this on the command-line:

Save this as ~/pdf-merge-text.sh (and chmod +x it):

Click to copy

#!/usr/bin/env bash

set -eu

pdf_merge_text() {
    local txtpdf; txtpdf="$1"
    local imgpdf; imgpdf="$2"
    local outpdf; outpdf="${3--}"
    if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi
    if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi
    if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi
    (
        local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)"
        trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT
        gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}"
        pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}"
    )
}

pdf_merge_text "$@"

Now just call it:

Click to copy

~/pdf-merge-text.sh txt.pdf img.pdf out.pdf

The idea is to strip images from the OCR'd PDF, then merge it via the the technique in the answer above.

Copy pdf text layer to another pdf

Here's a simple shell script to do this on the command-line:

Tags:

Pdf

Ocr

Adobe Acrobat

Related

Recent Posts