Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?
You've seen it: it isn't there.
So you can either modify Tesseract source code to output hOCR format that supports x_confs property that you want or use its ResultIterator
API class to get confidence at the character (symbol) level (be sure to SetVariable("save_blob_choices", "T")
after Init
method).
This now seems to be available in Tesseract 4.x.
See my answer at:
https://stackoverflow.com/a/57766860/1021819
Set hocr_char_boxes to 1 in your config file. Or, at the command line, your updated command would be:
tesseract [Image name] outputbase --oem 1 -l eng --psm 8 -c hocr_char_boxes=1 hocr Note the hocr output option and look in that file for ..._wconf, e.g.
Let me know if this works for you, otherwise I'll just delete the answer.
Source: https://github.com/tesseract-ocr/tesseract/issues/1465#issuecomment-513139976