How can I extract text from images?

The act of extracting text from images is called OCR and Ubuntu has a wiki page dedicated to OCR. From that page:

Available OCR tools

The Ubuntu Universe repositories contain the following OCR tools:

gocr - A command line OCR
fuzzyocr - spamassassin plugin to check image attachments
libhocr0 - Hebrew OCR
ocrad - Optical Character Recognition program
ocrfeeder - Document layout analysis and optical character recognition system
ocropus - document analysis and OCR system
tesseract-ocr

The Ubuntu multiverse respositories also contain:

cuneiform - multi-language OCR system

Some packages are outdated, but unofficial fresh ones can be found in Alex_P PPA (PPA adding code: ppa:alex-p/notesalexp). If you never used a PPA check how to add software from a PPA.

edit: As shown in comment Clara OCR exists too but it got stuk at Hardy and their website has 2009 as last updated.

tesseract-ocr would be the great one compared to all others. For Installation, run the below command

sudo apt-get install tesseract-ocr

Usage is tesseract filename.jpg output.txt, then it will generate output.txt file.

You might consider selecting the appropriate language. In that case, you will need to install tesseract-ocr-LANG package, where LANG is the three-letter ISO 639-2 language code. Right now you have 123 languages on 18.04 repo. Then use for example:

tesseract mySpanishText.jpg output -l spa

How can I extract text from images?

Tags:

Ocr

Software Recommendation

Related

Recent Posts