problem to recognize a table through a png image
Here's a way to augment @georg279's approach. We can use ImageLines
to subdivide the image.
First we use a derivative filter to highlight the horizontal lines. The parameters need to be tweaked by hand.
img = Import@"http://i.stack.imgur.com/Ricz2.png"
i2 = Binarize[DerivativeFilter[img, {1, 0}, 0.2], 0.09]
Then we can get the lines and inspect them:
lines = ImageLines[i2, 0.19, 0.008];
HighlightImage[img, {Orange, Line /@ lines}]
We got every row entry, plus a block below the table, which we can discard later. We can use the coordinates in lines
to subdivide the image and apply TextRecognize
to the pieces:
tdata = TextRecognize[ImageResize[#, Scaled[8]]] & /@
Reverse@Rest[
ImageTake[img, -Reverse@#] & /@
Partition[Round@Sort@lines[[All, 1, 2]], 2, 1]]
We can then convert the numerals in the last ten columns to numeric data. There's a problem with the missing data in the columns and the spaces in the names in the first column. By padding with "XXX"
, the entries last column were all converted, but removing the Xs took inspection.
Replace[
ToExpression[(StringSplit[tdata] /.
"X" | "XX" | "XXX" | "xxx" :> Sequence[])[[All, -11 ;;]]],
{x_Real :> x, n_Integer :> n, I | $Failed -> Missing["NotAvailable"]},
2]
(*
{{2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013},
{1.76, 1.76, 1.78, 1.78, 1.85, 1.94, 1.93, 1.97, 2.01, 2.01},
...
{2.49, 2.68, 2.79, 3.01, 3.21, 3.36, 3.56, 3.74, 4.04,
Missing["NotAvailable"], Missing["NotAvailable"]},
{2.55, 2.49, 2.51, 2.55, 2.63, 2.77, 2.82, 2.74, 2.77, 2.81, Missing["NotAvailable"]}}
*)
TextRecognise
seems to fare better if you feed it smaller regions. Here I manually isolate individual lines from the table:
i0 = Import["http://i.stack.imgur.com/Ricz2.png"];
ImageTake[i0, {60, 73}]
Column[TextRecognize[
ImageResize[ImageTake[i0, {#, # + 13}], Scaled[8]]] & /@
Range[60, 360, 16]]