Can OCR software reliably read values from a table?

Simple answer is YES, you should just choose right tools.

I don't know if open source can ever get close to 100% accuracy on those images, but based on the answers here probably yes, if you spend some time on training and solve table analisys problem and stuff like that.

When we talk about commertial OCR like ABBYY or other, it will provide you 99%+ accuracy out of the box and it will detect tables automatically. No training, no anything, just works. Drawback is that you have to pay for it $$. Some would object that for open source you pay your time to set it up and mantain - but everyone decides for himself here.

However if we talk about commertial tools, there is more choice actually. And it depends on what you want. Boxed products like FineReader are actually targeting on converting input documents into editable documents like Word or Excell. Since you want actually to get data, not the Word document, you may need to look into different product category - Data Capture, which is essentially OCR plus some additional logic to find necessary data on the page. In case of invoice it could be Company name, Total amount, Due Date, Line items in the table, etc.

Data Capture is complicated subject and requires some learning, but being properly used can give quaranteed accuracy when capturing data from the documents. It is using different rules for data cross-check, database lookups, etc. When necessary it may send datafor manual verification. Enterprises are widely usind Data Capture applicaitons to enter millions of documents every month and heavily rely on data extracted in their every day workflow.

And there are also OCR SDK ofcourse, that will give you API access to recognition results and you will be able to program what to do with the data.

If you describe your task in more detail I can provide you with advice what direction is easier to go.

UPDATE

So what you do is basically Data Capture application, but not fully automated, using so-called "click to index" approach. There is number of applications like that on the market: you scan images and operator clicks on the text on the image (or draws rectangle around it) and then populates fields to database. It is good approach when number of images to process is relatively small, and manual workload is not big enough to justify cost of fully automated application (yes, there are fully automated systems that can do images with different font, spacing, layout, number of rows in the tables and so on).

If you decided to develop stuff and instead of buying, then all you need here is to chose OCR SDK. All UI you are going to write yoursself, right? The big choice is to decide: open source or commercial.

Best Open source is tesseract OCR, as far as I know. It is free, but may have real problems with table analysis, but with manual zoning approach this should not be the problem. As to OCR accuracty - people are often train OCR for font to increase accuracy, but this should not be the case for you, since fonts could be different. So you can just try tesseract out and see what accuracy you will get - this will influence amount of manual work to correct it.

Commertial OCR will give higher accuracy but will cost you money. I think you should anyway take a look to see if it worth it, or tesserack is good enough for you. I think the simplest way would be to download trial version of some box OCR prouct like FineReader. You will get good idea what accuracy would be in OCR SDK then.


If you always have solid borders in your table, you can try this solution:

  1. Locate the horizontal and vertical lines on each page (long runs of black pixels)
  2. Segment the image into cells using the line coordinates
  3. Clean up each cell (remove borders, threshold to black and white)
  4. Perform OCR on each cell
  5. Assemble results into a 2D array

Else your document have a borderless table, you can try to follow this line:

Optical Character Recognition is pretty amazing stuff, but it isn’t always perfect. To get the best possible results, it helps to use the cleanest input you can. In my initial experiments, I found that performing OCR on the entire document actually worked pretty well as long as I removed the cell borders (long horizontal and vertical lines). However, the software compressed all whitespace into a single empty space. Since my input documents had multiple columns with several words in each column, the cell boundaries were getting lost. Retaining the relationship between cells was very important, so one possible solution was to draw a unique character, like “^” on each cell boundary – something the OCR would still recognize and that I could use later to split the resulting strings.

I found all this information in this link, asking Google "OCR to table". The author published a full algorithm using Python and Tesseract, both opensource solutions!

If you want to try the Tesseract power, maybe you should try this site:

http://www.free-ocr.com/


I've been OCR'ing scanned documents since '98. This is a recurring problem for scanned docs, specially for those that include rotated and/or skewed pages.

Yes, there are several good commercial systems and some could provide, once well configured, terrific automatic data-mining rate, asking for the operator's help only for those very degraded fields. If I were you, I'd rely on some of them.

If commercial choices threat your budget, OSS can lend a hand. But, "there's no free lunch". So, you'll have to rely on a bunch of tailor-made scripts to scaffold an affordable solution to process your bunch of docs. Fortunately, you are not alone. In fact, past last decades, many people have been dealing with this. So, IMHO, the best and concise answer for this question is provided by this article:

https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/

Its reading is worth! The author offers useful tools of his own, but the article's conclusion is very important to give you a good mindset about how to solve this kind of problem.

"There is no silver bullet." (Fred Brooks, The Mitical Man-Month)


Which OCR you are talking about?
Will you be developing codes based on that OCR or you will be using something off the shelves?

FYI: Tesseract OCR

it has implemented the document reading executable, so you can feed the whole page in, and it will extract characters for you. It recognizes blank spaces pretty well, it might be able to help with tab-spacing.

Tags:

Ocr