Simple Captcha Solving
I know the post is a bit old but I suggest you to try this library I've developed some time ago. If you have a set of labelled captchas that service would fit you. Take a look: https://github.com/punkerpunker/captcha_solver
In README there is a section "Train model on external data" that you might be interested in.
I've taken a much more direct approach to filtering ink splotches from pdf documents. I won't share the whole thing it's a lot of code, but here is the general strategy I adopted:
- Use Python Pillow library to get an image object where you can manipulate pixels directly.
- Binarize the image.
- Find all connected pixels and how many pixels are in each group of connected pixels. You can do this using the minesweeper algorithm. Which is easy to search for.
- Set some threshold value of pixels that all legitimate letters are expected to have. This will be dependent on your image resolution.
- replace all black pixels in groups below the threshold with white pixels.
- Convert back to image.
There is an OCR example in python that detect the characters. Save several images and apply the filter and train a SVM algorithm. that may help you. I did trained a algorithm with even few Images but the results were acceptable. Check this link. Wish you luck
Your final output image is too blurry. To enhance the performance of pytesseract you need to sharpen it.
Sharpening is not as easy as blurring, but there exist a few code snippets / tutorials (e.g. http://datahacker.rs/004-how-to-smooth-and-sharpen-an-image-in-opencv/).
Rather than chaining blurs, blur once either using Gaussian or Median Blur, experiment with parameters to get the blur amount you need, perhaps try one method after the other but there is no reason to chain blurs of the same method.