In many business scenarios, an essential daily task is scanning, classifying and extracting key information from printed documents. At PwC Italy, auditors and lawyers dedicate a great deal of time to classifying documents before they can glean any insight from them.
Video of the final presentation of the project
This issue led PwC Italy to get involved with Pi School’s Artificial Intelligence Programme, sponsoring a full grant for engineer Roberto Calandrini. The desired solution was a model, trained using both private and public data, which could automatically assign a category to scanned documents.
Supported by the Faculty Director of the Artificial Intelligence Programme, Sébastien Bratières, and by his mentor, Riccardo Sabatini, Chief Data Scientist at Orionis Biosciences, Roberto Calandrini applied OCR and classification techniques to document scans in order to assign multi-page documents to one of several predefined classes such as invoice, work contract, vendor contract and receipt.
The issue of automatic document classification has been extensively studied over the last twenty years. But due to the complexity of the problem, which varies depending on where the technology is applied, there is still no standard, robust method that is valid in all potential cases.
For this particular task, the best representation model for the document is an image, a set of pixels of different intensities or a matrix with 1 colour channel, similar to the output of a scanner.
There are a number of other potential applications of these methods for PwC as a business. For financial auditors or data rooms, for example, automatic classification of all the documents related to a potential client would undoubtedly improve efficiency during the preliminary phase, in which the auditor has a short amount of time to assess a prospect, or alternatively during M&A analysis.
If developed further in the future, it could also help to lower the total operational risk of human error in terms of misinterpreting documents, applying Text Analysis after the Image Document Recognition phase.
Convolutional Neural Networks are very powerful non-linear models that could easily reach tens of millions of parameters, becoming hard to train and use in real-world scenarios. They set new standards for accuracy in classification tasks but cannot be put forward as the best general method for classification, especially considering that their training time is very long (5-6 days) and that they could easily overfit the data set.
Most recent methods use spatial transformers, together with CNNs, in order to mitigate the lack of automatic generalisation from the CNN of the affine transformation of the input image (e.g. rotation, shear, scaling, etc.), which is a common problem in document images.
In the data processing pipeline that was developed, two fast robust feature extractors and various classifiers were used with the aim of tackling this problem using different approaches. They performed well (60.2% overall accuracy) while being 120x faster than a CNN for the training and prediction phase.
The RVL-CDIP data set was selected as the reference to test the image document classification methods. It has 400,000 legal documents labelled in 16 different categories, along with all the possible data quality issues found in real scenarios, including rotated, skewed, scaled and noisy documents with different aspect ratios.
The number of samples per class is almost perfectly balanced. However, the data displayed a high degree of variation in terms of orientation (documents of the same class rotated at different angles), brightness, font, layout, and signal-to-noise ratio (SNR), making it difficult to use without pre-processing.
Two measures were adopted to overcome these issues: Rescaling and Histogram Equalization. These were used to set a common aspect ratio and dimension and feed them into the Feature Extractor. Robust Feature Extraction was used to maintain the consistency of various image affine transformations and reduce the impact of intra-class variance on the classifier.
The volume of the data amounts to 49.5GB of images, so the pre-processing and feature extraction steps were executed once for all of them, saving all the partial results to disk using an HDF5 file system.
The approach taken can be classified as supervised learning, since the research used pairs of image feature class training samples to train multiple classifiers with the aim of making predictions regarding the unseen document image. However, the feature extraction method used is adaptable for online learning, provided a proper distance measurement metric is applied in the feature space.
The robust feature extraction methods adopted are based on the concept of Perceptual Hashing. This method transforms the image into an equivalent feature space representation while compressing it into a very compact form.
Two variants of the algorithm were used: Average Hashing and Perceptual Hashing.
The best results were obtained with a combination of Average Hashing and a Parallelised Random Forest classifier, with an overall 10% improvement of the version using a 128×128 Hash compared to the 64×64 Hash version, and steady linear improvements as the number of samples was increased.
This study developed a machine learning pipeline to process big data in the document image field in order to extract meaningful and robust features and feed them into a classifier that can distinguish from among 16 different document layout styles.
Recent scientific literature on the subject was analysed, as was the current state of the art: the method known as the Convolutional Neural Network Approach. However, this method has various drawbacks that limit its use in real-world scenarios, for example its training time (in the order of days), training complexity (millions of parameters and hyperparameters needing tuning), and robustness to intra-class variance.
The method developed lays the foundations for future developments, particularly the exploration of robust neural network methods for Fast Image Processing with affine transformation corrupted data (like spatial transformers) in order to assess the performance against a more specific data set provided by PwC.
Another interesting use would be to merge text analysis methods based on word embedding and image analysis methods like spatial transformers to produce automatic content-layout based document classification. A principled approach that promises to give the best of both text and image analysis.