How can I extract text from images?
The act of extracting text from images is called OCR
and Ubuntu has a wiki page dedicated to OCR. From that page:
Available OCR tools
The Ubuntu Universe repositories contain the following OCR tools:
- gocr - A command line OCR
- fuzzyocr - spamassassin plugin to check image attachments
- libhocr0 - Hebrew OCR
- ocrad - Optical Character Recognition program
- ocrfeeder - Document layout analysis and optical character recognition system
- ocropus - document analysis and OCR system
- tesseract-ocr
The Ubuntu multiverse respositories also contain:
- cuneiform - multi-language OCR system
Some packages are outdated, but unofficial fresh ones can be found in Alex_P PPA (PPA adding code: ppa:alex-p/notesalexp). If you never used a PPA check how to add software from a PPA.
edit: As shown in comment Clara OCR exists too but it got stuk at Hardy and their website has 2009 as last updated.
tesseract-ocr
would be the great one compared to all others.
For Installation, run the below command
sudo apt-get install tesseract-ocr
Usage is tesseract filename.jpg output.txt
, then it will generate output.txt
file.
You might consider selecting the appropriate language. In that case, you will need to install tesseract-ocr-LANG
package, where LANG
is the three-letter ISO 639-2 language code. Right now you have 123 languages on 18.04 repo. Then use for example:
tesseract mySpanishText.jpg output -l spa