Why can't get string with PIL and pytesseract?
Let's start with the JPG image, because pytesseract has issues operating on GIF image formats. reference
filename = "/tmp/target.jpg"
image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret, threshold = cv2.threshold(gray,55, 255, cv2.THRESH_BINARY)
print(pytesseract.image_to_string(threshold))
Let's try to breakdown the issues here.
Your image is too noisy for tesseract engine to identify the letters, We use some simple image processing techniques such as grayscaling and thresholding to remove some noise from the image.
Then when we send it to the OCR engine, we see that the letters are captured more accurately.
You can find my notebook where I tested this out if you follow this github link
Edit - I have updated the notebook with some additional image cleaning techniques. The source image is too noisy for tesseract to work directly out of the box on the image. You need to use image cleaning techniques.
You can vary the thresholding parameters or swap out gaussian blur for some other technique until you get your desired results.
If you are looking to run OCR on noisy images - please check out commercial OCR providers such as google-cloud-vision. They provide 1000 OCR calls free per month.
First: make certain you've installed the Tesseract program (not just the python package)
Jupyter Notebook of Solution: Only the image passed through remove_noise_and_smooth
is successfully translated with OCR.
When attempting to convert image.gif, TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'
is generated.
Rename image.gif to image.jpg, the TypeError
is generated
Open image.gif and 'save as' image.jpg, the output is blank, which means the text wasn't recognized.
from PIL import Image
import pytesseract
# If you don't have tesseract executable in your PATH, include the following:
# your path may be different than mine
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe"
imgo = Image.open('0244R_clean.jpg')
print(pytesseract.image_to_string(imgo))
- No text is recognized from the original image, so it may require post-processing to clean prior to OCR
- I created a clean image, which pytesseract extracts the text from without issue. The image is included below, so you can test it with your own code to verify functionality.
Add Post-Processing
Improve Accuracy of OCR using Image Preprocessing
OpenCV
import cv2
import numpy as np
import matplotlib.pyplot as plt
def image_smoothening(img):
ret1, th1 = cv2.threshold(img, 88, 255, cv2.THRESH_BINARY)
ret2, th2 = cv2.threshold(th1, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
blur = cv2.GaussianBlur(th2, (5, 5), 0)
ret3, th3 = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
return th3
def remove_noise_and_smooth(file_name):
img = cv2.imread(file_name, 0)
filtered = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 9, 41)
kernel = np.ones((1, 1), np.uint8)
opening = cv2.morphologyEx(filtered, cv2.MORPH_OPEN, kernel)
closing = cv2.morphologyEx(opening, cv2.MORPH_CLOSE, kernel)
img = image_smoothening(img)
or_image = cv2.bitwise_or(img, closing)
return or_image
cv2_thresh_list = [cv2.THRESH_BINARY, cv2.THRESH_TRUNC, cv2.THRESH_TOZERO]
fn = r'/tmp/target.jpg'
img1 = remove_noise_and_smooth(fn)
img2 = cv2.imread(fn, 0)
for i, img in enumerate([img1, img2]):
img_type = {0: 'Preprocessed Images\n',
1: '\nUnprocessed Images\n'}
print(img_type[i])
for item in cv2_thresh_list:
print('Thresh: {}'.format(str(item)))
_, thresh = cv2.threshold(img, 127, 255, item)
plt.imshow(thresh, 'gray')
f_name = '{}_{}.jpg'.format(i, str(item))
plt.savefig(f_name)
print('OCR Result: {}\n'.format(pytesseract.image_to_string(f_name)))
img1 will generate the following new images:
img2 will generate these new images: