December 9, 2024
document preprocessing deep learning

Scanned Document Preprocessing For Classification and Feature Extraction

These days document images feature extraction and classification are highly demanded tasks in companies and organizations. The image can be a digital document or scanned paper. Feature extraction is the task of extracting information from document image. Whereas the classification is the process of classifying documents based on their text contents and/or their structural properties.

Working with digital document images is much easier than dealing with scanned documents/papers, because first ones are mostly trim and neat, yet the scanned documents are often noisy, crooked and angled/skewed. In this post I will share my experience from my recent project in which I had to extract information from around 2,000,000 scanned papers. I focus on tree important tasks, denoising , binarization and aligning the skewed document image. I use Python and OpenCV to work on images.

Denoising the document image

Removing noise from scanned paper is a necessary task before applying machine learning algorithms. There exist several supervised/unsupervised denoising methods. In this post, we use non-local means method to eliminate noises from image. It simply replaces the color of a pixel with an average of the colors of similar pixels. But the most similar pixels to a given pixel have no reason to be close at all (paper link). OpenCV fastNlMeansDenoising() function removes noises using non-local means denoising algorithm with some computational optimizations.

import cv2
import numpy as np

#read the noisy image
noisyImage= cv2.imread("noisy_image.jpg",cv2.IMREAD_GRAYSCALE)
#applying fast non-local means denoisong filter
denoisedImage= cv2.fastNlMeansDenoising(noisyImage, None, h = 44, templateWindowSize  = 7, searchWindowSize = 21)
#join noisy and denoised images
nosiy_denoised = np.concatenate((noisyImage, denoisedImage), axis=1)
#save joined images in file
cv2.imwrite("nosiy_denoised.jpg",nosiy_denoised)

In the fastNlMeanDenoising function we need to specify following parameters :

templateWindowSize : It is size of the template patch which is used to compute weights. Should be odd. Recommended value for better denoising performance is 7

searchWindowSize : It is size of the window that is used to compute weighted average for
given pixel. Should be odd. The greater value leads to longer denoising time. Recommended value for best performance is 21.

h : This parameter regulates filter strength. A big h value perfectly removes noise but also has side effects on image details, whereas smaller h value preserves details but also preserves some noise.

Output of above code on a noisy image :

image denoise
scanned document denoising

Scanned Document Binarization

Binarization is a crucial task that should be done before feature extraction, it converts an image into black and white image in which white pixels are represented by 255 and black pixels by 0. We do the binarization using a threshold. If in the given image a pixel value exceeds the threshold, we set it as a white pixel with value of 255, otherwise we set the pixel as black with value of zero. If we choose a good threshold, it can also help in noise reduction. So, choosing appropriate threshold is the most important part of binarizatio. OTSU’s method calculates a threshold for the whole image considering the several characteristics of the entire image. When we use OTSU method, we don’t need to determine threshold explicitly, so the threshold function will ignore the 0 and 255 in the argument.

import cv2
import numpy as np

#read image from file
img = cv2.imread("test.jpg",cv2.IMREAD_GRAYSCALE)

# binarization with OTSU threshold finder. 0 and 255 are ignored
threshValue, binaryImage = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)

normal_binary = np.concatenate((img, binaryImage), axis=1)

cv2.imwrite("normal_binary.jpg",normal_binary)

Ouput of above code on our scanned document :

scanned document binarization using otsu method
scanned document binarization using otsu method

Aligning Scanned document

Skewed scanned document is a common issue in feature extraction and also image classification tasks. To solve this problem by re-aligning the document image, first we need to find the deviation angle of the content against the horizontal line. Then, we can rotate the image in the opposite direction of deviation to align the document. To find the deviation angle of the content against the horizontal line, we have to extract content’s lines, we do it using Canny edge detection function along with HoughLinesP line detection function. When we have the widest line of the document we can find the angle between it and the horizontal line, we wrote get_angle function to do this task for us. Finally, we rotate the image content to remove deviation, it is done by rotate_image function in the code below :


import cv2
import numpy as np
import math

def get_angle(x1, y1, x2, y2) -> float:
    """Get the angle of this line with the horizontal axis."""
    deltaX = x2 - x1
    deltaY = y2 - y1
    angleInDegrees = np.arctan2(deltaY , deltaX) * 180 / math.pi
    
    return angleInDegrees

def rotate_image(image, angle):
    image_center = tuple(np.array(image.shape[1::-1]) / 2)
    rot_mat = cv2.getRotationMatrix2D(image_center, angle, 1.0)
    result = cv2.warpAffine(image, rot_mat, image.shape[1::-1], flags=cv2.INTER_LINEAR, borderValue=(255,255,255) )
    return result


def align_image(img):

    # Median blurring to get rid of the noise; invert image
    #img =  cv2.medianBlur(img, 3) # use this if the document image is noisy

    edges = cv2.Canny(img, 80, 120)

    # Detect and draw lines
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, 10, minLineLength=20, maxLineGap=10)
    # sort lines from widest to shortest
    lines = sorted(lines,key = (lambda l: abs(l[0][0]-l[0][2])) , reverse = True)

    # if there exist any line, compare it by horizontal line
    # and rotate the image if the angle difference is more than 0.25
    for line in lines:
        for x1, y1, x2, y2 in line:
            if (abs(x2-x1) / edges.shape[1])>0.25 :
                angle = get_angle(x1, y1, x2, y2)
                if abs(angle) > 1.0 :
                    img = rotate_image(img,angle)
                    print("rotated")
        #exit after comparing widest line
        break

    return img

#read the unligned image
unalignedImage= cv2.imread("unaligned_image.jpg",cv2.IMREAD_GRAYSCALE)

#apply re-aligning function
aligned_image = align_image(unalignedImage)

unaligned_aligned = np.concatenate((unalignedImage, aligned_image), axis=1)
#save joined images in file
cv2.imwrite("unaligned_aligned.jpg",unaligned_aligned)

Output of the align_image function :

image alignment
Left : Before re-aligning the document , Right : after re-alignment

You can find the snippet code working on a sample noisy document on Github repository.

If you have any question on this post, please don’t hesitate to leave here a comment.

Leave a Reply

Your email address will not be published. Required fields are marked *