These days document images feature extraction and classification are highly demanded tasks in companies and organizations. The image can be a digital document or scanned paper. Feature extraction is the task of extracting information from document image. Whereas the classification is the process of classifying documents based on their text contents and/or their structural properties.
Working with digital document images is much easier than dealing with scanned documents/papers, because first ones are mostly trim and neat, yet the scanned documents are often noisy, crooked and angled/skewed. In this post I will share my experience from my recent project in which I had to extract information from around 2,000,000 scanned papers. I focus on tree important tasks, denoising , binarization and aligning the skewed document image. Due to simplicity and enjoying comprehensive image processing tools, I use Python and OpenCV to work on images.
Denoising the document image
Removing noise from scanned paper is a necessary task before applying machine learning algorithms. There exist several supervised/unsupervised denoising methods. In this post, we use non-local means method to eliminate noises from image. It simply replaces the color of a pixel with an average of the colors of similar pixels. But the most similar pixels to a given pixel have no reason to be close at all (paper link). OpenCV fastNlMeansDenoising() function removes noises using non-local means denoising algorithm with some computational optimizations.
import cv2 import numpy as np #read the noisy image noisyImage= cv2.imread("noisy_image.jpg",cv2.IMREAD_GRAYSCALE) #applying fast non-local means denoisong filter denoisedImage= cv2.fastNlMeansDenoising(noisyImage, None, h = 44, templateWindowSize = 7, searchWindowSize = 21) #join noisy and denoised images nosiy_denoised = np.concatenate((noisyImage, denoisedImage), axis=1) #save joined images in file cv2.imwrite("nosiy_denoised.jpg",nosiy_denoised)
In the fastNlMeanDenoising function we need to specify following parameters :
templateWindowSize : It is size of the template patch which is used to compute weights. Should be odd. Recommended value for better denoising performance is 7
searchWindowSize : It is size of the window that is used to compute weighted average for
given pixel. Should be odd. The greater value leads to longer denoising time. Recommended value for best performance is 21.
h : This parameter regulates filter strength. A big h value perfectly removes noise but also has side effects on image details, whereas smaller h value preserves details but also preserves some noise.
Output of above code on a noisy image :
Scanned Document Binarization
Binarization is a crucial task that should be done before feature extraction, it converts an image into black and white image in which white pixels are represented by 255 and black pixels by 0. We do the binarization using a threshold. If in the given image a pixel value exceeds the threshold, we set it as a white pixel with value of 255, otherwise we set the pixel as black with value of zero. If we choose a good threshold, it can also help in noise reduction. So, choosing appropriate threshold is the most important part of binarizatio. OTSU’s method calculates a threshold for the whole image considering the several characteristics of the entire image. When we use OTSU method, we don’t need to determine threshold explicitly, so the threshold function will ignore the 0 and 255 in the argument.
import cv2 import numpy as np #read image from file img = cv2.imread("test.jpg",cv2.IMREAD_GRAYSCALE) # binarization with OTSU threshold finder. 0 and 255 are ignored threshValue, binaryImage = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU) normal_binary = np.concatenate((img, binaryImage), axis=1) cv2.imwrite("normal_binary.jpg",normal_binary)
Ouput of above code on our scanned document :
Aligning Scanned document
Skewed scanned document is a common issue in feature extraction and also image classification tasks. To solve this problem by re-aligning the document image, first we need to find the deviation angle of the content against the horizontal line. Then, we can rotate the image in the opposite direction of deviation to align the document. To find the deviation angle of the content against the horizontal line, we have to extract content’s lines, we do it using Canny edge detection function along with HoughLinesP line detection function. When we have the widest line of the document we can find the angle between it and the horizontal line, we wrote get_angle function to do this task for us. Finally, we rotate the image content to remove deviation, it is done by rotate_image function in the code below :
import cv2 import numpy as np import math def get_angle(x1, y1, x2, y2) -> float: """Get the angle of this line with the horizontal axis.""" deltaX = x2 - x1 deltaY = y2 - y1 angleInDegrees = np.arctan2(deltaY , deltaX) * 180 / math.pi return angleInDegrees def rotate_image(image, angle): image_center = tuple(np.array(image.shape[1::-1]) / 2) rot_mat = cv2.getRotationMatrix2D(image_center, angle, 1.0) result = cv2.warpAffine(image, rot_mat, image.shape[1::-1], flags=cv2.INTER_LINEAR, borderValue=(255,255,255) ) return result def align_image(img): # Median blurring to get rid of the noise; invert image #img = cv2.medianBlur(img, 3) # use this if the document image is noisy edges = cv2.Canny(img, 80, 120) # Detect and draw lines lines = cv2.HoughLinesP(edges, 1, np.pi/180, 10, minLineLength=20, maxLineGap=10) # sort lines from widest to shortest lines = sorted(lines,key = (lambda l: abs(l-l)) , reverse = True) # if there exist any line, compare it by horizontal line # and rotate the image if the angle difference is more than 0.25 for line in lines: for x1, y1, x2, y2 in line: if (abs(x2-x1) / edges.shape)>0.25 : angle = get_angle(x1, y1, x2, y2) if abs(angle) > 1.0 : img = rotate_image(img,angle) print("rotated") #exit after comparing widest line break return img #read the unligned image unalignedImage= cv2.imread("unaligned_image.jpg",cv2.IMREAD_GRAYSCALE) #apply re-aligning function aligned_image = align_image(unalignedImage) unaligned_aligned = np.concatenate((unalignedImage, aligned_image), axis=1) #save joined images in file cv2.imwrite("unaligned_aligned.jpg",unaligned_aligned)
Output of the align_image function :
You can find the snippet code working on a sample noisy document on Github repository.
If you have any question on this post, please don’t hesitate to leave here a comment.