Apple - OCR on PDFs in OS X with free, open source tools
Tesseract 3.03+ has built in support for PDF output. Which requires leptonica to be installed. You can use:
brew install tesseract --HEAD
to get the latest version of tesseract. You will also need ghostscript installed but no need for hocr2pdf.
The following script uses ghostscript to split the PDF into JPEGs, tesseract to OCR the JPEGs and output single PDF pages, and finally ghostscript again to combine the pages back into one PDF.
#!/bin/sh
y="`pwd`/$1"
echo Will create a searchable PDF for $y
x=`basename "$y"`
name=${x%.*}
mkdir "$name"
cd "$name"
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f "$y"
# process each page
for f in $( ls *.jpg ); do
# extract text
tesseract -l eng -psm 3 $f ${f%.*} pdf
rm $f
done
# combine all pages back to a single file
gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile="../${name}_searchable.pdf" *.pdf
cd ..
rm -rf "${name}"
# Adapted from: http://www.morethantechnical.com/2013/11/21/creating-a-searchable-pdf-with-opensource-tools-ghostscript-hocr2pdf-and-tesseract-ocr/
# from http://www.ehow.com/how_6874571_merge-pdf-files-ghostscript.html
# bash tut: http://linuxconfig.org/bash-scripting-tutorial
# Linux PDF,OCR: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/