I have been working on ML projects that require image preprocessing and text extraction. To improve the quality of text extraction, there are many preprocessing steps that we need to do, they are elicited below. We use OpenCV for doing the preprocessing and tesseract-ocr for text extraction.
Image preprocessing
- Rescaling: Rescales the smaller images by using resize function.(conditional application)
- Gray Scale Conversion: Converts the image from 4 channels (RGB+alpha) to gray scale.
- Morphology-Ex: We use the morph_open transformation which is erosion followed by dilation.
- Gaussian Filtering: Apply gaussian filter to smooth out text.
- Binarization: Converts the gray scale image to binary image by applying thresholding.
Skew and/or Rotation Detection
In our case, we primarily find that scans are rotated to every possible orientation, but they are not skewed much. Thus, detecting correct orientation, improves the quality of the extracted text.
We use a tweaked hybrid approach: Hough Lines for angle detection and comparing quality of the extracted text (against a dictionary).
We need the hybrid approach as even with proper skew/rotation detection, you cannot tell whether the text is upside down. We tweak the Hough Lines transform, so that it majorly detects the text as lines and not the borders, which may or may not be present in the image.
To accomplish this, we tweaked the accumulators' rho parameter to bucket a lot of lines together. This causes fewer borders to be included, while having little effect on text lines. This is illustrated in the below images.
Looking at the function definition of HoughLinesP.
cv2.HoughLinesP(image, rho, theta, threshold[, lines[, minLineLength[, maxLineGap]]])
- Each of these parameters is significant:
- theta: Provides the angle resolution. same as the theta in polar coordinates. As with rho, higher values lead to bucketing together.
- rho: Provides distance resolution. Higher values lead to bucketing together. The combination of rho and theta produces a rho*theta matrix that serves as the accumulator.
- threshold: This parameter ensures that the accumulator does not consider a point, if the number of lines crossing the point is not greater than threshold.
- minLineLength: Only lines greater than the minimum length are returned. This can be decreased to allow for short phrases of text.
- maxLineGap: The maximum allowed space between collinear points, before not considering them part of the same line. This should be kept lower, ideally less than the line gap between lines.
Also note that hough lines requires some preprocessing of image with an edge detection algorithm. And, we use Canny Edge algorithm.
The HoughLinesP
function returns the coordinates of lines and we simply apply arctan to get the angle of lines and pick the mode, this give us the detected angle.
The final step then is to run tesseract on rotated image and its inverse. We use langdetect python library for detecting language of text. And pass the word tokens through the language dictionary. Finally, we pick the orientation which detects the most words and save the rotated image and the extracted text.
Tuning or Tweaks
We run the entire preprocessing stage using javacpp which has wrappers for OpenCV, this leads to higher performance(We use parallel streams for processing individual images in parallel), and leave the orientation detection and extraction to python (we employ python process pool and map for parallelization).