Need good OCR for printed source code listing, any ideas?
Try http://www.free-ocr.com/. I have used it to recover source code from a screen grab when my IDE crashes in an editor session without warning. It obviously depends on the font you are using in the editor (I use Courier New 10pt in Delphi). I tried to use Google Docs, which will OCR an image when you upload it - while Google Docs is pretty good on scanned documents, it fails miserably on Pascal source for some reason.
An example of FreeOCR at work: Input image:
gave this:
begin
FileIDToDelete := FolderToClean + 5earchRecord.Name ;
Inc (TotalFilesFound) ;
if (DeleteFile (PChar (FileIDToDelete))) then
begin
Log5tartupError (FormatEx (‘%s file %s deleted‘, [Annotation, Fi eIDToDelete])) ;
Inc (TotalFilesDeleted) ;
end
else
begin
Log5tartupError (FormatEx (‘Error deleting %s file %s‘, [Annotat'on, FileIDToDelete])) ;
Inc (TotalFilesDeleteErrors) ;
end ;
end ;
FindResult := 5ysUtils.FindNext (5earchRecord) ;
end ;
so replacing the indentation is the bulk of the work, then changing all 5
's to upper case S
. It also got confused by the vertical line at the 80 column mark. Luckily most errors will be picked up by the compiler (with the exception of mistakes inside quoted strings).
It's a shame FreeOCR doesn't have a "source code" option, where white space is treated as significant.
A tip: If your source includes syntax highlighting, make sure you save the image as grayscale before uploading.
Two new options exists today (years after the question was asked):
1.)
Windows 10 comes with an OCR engine from Microsoft.
It is in the namespace:
Windows.Media.Ocr.OcrEngine
https://msdn.microsoft.com/en-us/library/windows/apps/windows.media.ocr
There is also an example on Github:
https://github.com/Microsoft/Windows-universal-samples/tree/master/Samples/OCR
You need either VS2015 to compile this stuff. Or if you want to use an older version of Visual Studio you must invoke it via traditional COM, then read this article on Codeproject: http://www.codeproject.com/Articles/262151/Visual-Cplusplus-and-WinRT-Metro-Some-fundamentals
The OCR quality is very good. Nevertheless if the text is too small you must amplify the image before. You can download every language that exists in the world via Windows Update - even for handwriting!
2.)
Another option is to use the OCR library from Office. It is a COM DLL. It is available in Office 2003, 2007 and Vista, but has been removed in Office 2010.
http://www.codeproject.com/Articles/10130/OCR-with-Microsoft-Office
The disadvantage is that every Office installation comes with support for few languages. For example a spanish Office installs support for spanish, english, portuguese and french. But I noticed that it nearly makes no difference if you use spanish or english as OCR language to detect a spanish text.
If you convert the image to greyscale you get better results. The recognition is OK, but it did not satisfy me. It makes approximately as much errors as Tesseract although Tesseract needs much more image preprocessing to get these results.
With OCR, there are currently three options:
- Abbee FineReader and OminPage. Both are commercial products which are about on par when it comes to features and OCR result. I can't say much about OmniPage but FineReader does come with support for reading source code (for example, it has a Java language library).
- The best OSS OCR engine is tesseract. It's much harder to use, you'll probably need to train it for your language.
I rarely do OCR but I've found that spending the $150 on the commercial software weights out the wasted time by far.