Why can't get string with PIL and pytesseract?

Let's start with the JPG image, because pytesseract has issues operating on GIF image formats. reference

filename = "/tmp/target.jpg"
image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret, threshold = cv2.threshold(gray,55, 255, cv2.THRESH_BINARY)
print(pytesseract.image_to_string(threshold))

Let's try to breakdown the issues here.

Your image is too noisy for tesseract engine to identify the letters, We use some simple image processing techniques such as grayscaling and thresholding to remove some noise from the image.

Then when we send it to the OCR engine, we see that the letters are captured more accurately.

You can find my notebook where I tested this out if you follow this github link

Edit - I have updated the notebook with some additional image cleaning techniques. The source image is too noisy for tesseract to work directly out of the box on the image. You need to use image cleaning techniques.

You can vary the thresholding parameters or swap out gaussian blur for some other technique until you get your desired results.

If you are looking to run OCR on noisy images - please check out commercial OCR providers such as google-cloud-vision. They provide 1000 OCR calls free per month.

First: make certain you've installed the Tesseract program (not just the python package)

Jupyter Notebook of Solution: Only the image passed through remove_noise_and_smooth is successfully translated with OCR.

When attempting to convert image.gif, TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple' is generated.

Rename image.gif to image.jpg, the TypeError is generated

Open image.gif and 'save as' image.jpg, the output is blank, which means the text wasn't recognized.

enter image description here

from PIL import Image
import pytesseract

# If you don't have tesseract executable in your PATH, include the following:
# your path may be different than mine
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe"

imgo = Image.open('0244R_clean.jpg')

print(pytesseract.image_to_string(imgo))

No text is recognized from the original image, so it may require post-processing to clean prior to OCR
I created a clean image, which pytesseract extracts the text from without issue. The image is included below, so you can test it with your own code to verify functionality.

enter image description here

Add Post-Processing

Improve Accuracy of OCR using Image Preprocessing

OpenCV

import cv2
import numpy as np
import matplotlib.pyplot as plt


def image_smoothening(img):
    ret1, th1 = cv2.threshold(img, 88, 255, cv2.THRESH_BINARY)
    ret2, th2 = cv2.threshold(th1, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    blur = cv2.GaussianBlur(th2, (5, 5), 0)
    ret3, th3 = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return th3


def remove_noise_and_smooth(file_name):
    img = cv2.imread(file_name, 0)
    filtered = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 9, 41)
    kernel = np.ones((1, 1), np.uint8)
    opening = cv2.morphologyEx(filtered, cv2.MORPH_OPEN, kernel)
    closing = cv2.morphologyEx(opening, cv2.MORPH_CLOSE, kernel)
    img = image_smoothening(img)
    or_image = cv2.bitwise_or(img, closing)
    return or_image


cv2_thresh_list = [cv2.THRESH_BINARY, cv2.THRESH_TRUNC, cv2.THRESH_TOZERO]

fn = r'/tmp/target.jpg'
img1 = remove_noise_and_smooth(fn)
img2 = cv2.imread(fn, 0)
for i, img in enumerate([img1, img2]):
    img_type = {0: 'Preprocessed Images\n',
                1: '\nUnprocessed Images\n'}
    print(img_type[i])
    for item in cv2_thresh_list:
        print('Thresh: {}'.format(str(item)))
        _, thresh = cv2.threshold(img, 127, 255, item)
        plt.imshow(thresh, 'gray')
        f_name = '{}_{}.jpg'.format(i, str(item))
        plt.savefig(f_name)
        print('OCR Result: {}\n'.format(pytesseract.image_to_string(f_name)))

img1 will generate the following new images:

enter image description here

img2 will generate these new images:

enter image description here

Why can't get string with PIL and pytesseract?

Add Post-Processing

Tags:

Python

Python 3.X

Ocr

Python Tesseract

Related

Recent Posts