What's the best, simplest OCR solution?
GOCR from is an OCR (Optical Character Recognition) program.It converts scanned images of text back to text files.
CLARA is another good graphical option.
OCRAD from is an OCR can be used as a stand-alone console application,or as a backend to other programs.
KOOKA from is a KDE application but works fine,in addition you have to install actual OCR programs like GOCR and OCRAD.After installing Kooka and the OCR programs,you have to point Kooka to the OCR install location in order for it to be able to convert the JPEG to text.
OCRFeeder from is a document layout analysis and optical character recognition system.
Tesseract from is Command line utility and it is very simple to use.You can install language package tesseract-ocr-eng from here.
Have a look at this page.
Note:
To run tesseract goto terminal and type the following
tesseract imagefile.tif outputfile.txt
Tesseract can only read a TIFF file - if you've got a JPEG or PDF or whatever, you'll have to convert it. Also, the filename extension must be .tif, not .tiff, otherwise tesseract errors out.
There are few popular OCR command-line tools you can use (I'm not sure if they've GUI):
Tesseract (ReadMe, FAQ) (Python)
Also available for: Tesseract .NET, Tesseract iOS
An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. Tesseract is probably the most accurate open source OCR engine available.
Usage:
tesseract [inputFile] [outputFile] [-l optionalLanguageFile] [PathTohOCRConfigFile]
GOCR
Open-source character recognition. It converts scanned images of text back to text files. GOCR can be used with different front-ends, which makes it very easy to port to different OSes and architectures. It can open many different image formats, and its quality have been improving in a daily basis.
OCRopus™ (FAQ) (written in Python, NumPy, and SciPy)
OCR system focusing on the use of large scale machine learning for addressing problems in document analysis, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.
OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.
Tessnet2 (Open source, OCR, Tesseract, .NET, DOTNET, C#, VB.NET, C++/CLI)
Tesseract is a C++ open source OCR engine. Tessnet2 is .NET assembly that expose very simple methods to do OCR. Tessnet2 is under Apache 2 license (like tesseract), meaning you can use it like you want, included in commercial products.
Few others: ABBYY CLI OCR for Linux, Asprise OCR
For more complete list, check: List of optical character recognition software at Wikipedia
See also: wanghaisheng/awesome-ocr
- A curated list of promising OCR resources at GitHub.
Gscan2PDF
OCR on multi page PDF or scanned documents
This is probably the easiest way. Gscan2pdf is a graphical tool which lets you not only scan files, but also import files and perform OCR on them. Install gscan2pdf from here , from Ubuntu Software Center or running this command in a terminal:
sudo apt-get install gscan2pdf
- Run gscan2pdf
- Import the pdf (Ctrl+O)
- Optional: Tools > Clean up
- Choose Tools > OCR Save (Ctrl+S)
Gscan2PDF can use customizable OCR engines, default is tesseract-ocr
You might consider selecting the appropriate language. In that case you will need to install tesseract-ocr-LANG
package, where LANG
is the three letter ISO 639-2 language code. Right now you have 108 languages on 16.04 repo.
- Source