The Vatican Secret Archive is a private repository belonging to the Pope which is dedicated to preserving all the acts promulgated by the Holy See. It is one of the largest and oldest historical archives in the world and is home to some of the most ancient and valuable documents on Earth, including Galileo Galilei’s trial paperwork and Henry VIII’s marriage annulment request.
Video of the final presentation of the project
Despite many years of efforts to digitise its collections, it is still very difficult for historians and scholars in the humanities to apply modern, automated document analysis methods and full-text searches to the archive, since digitised document images require transcription first.
Transcribing historical documents is an arduous and expensive task that requires expert knowledge: only an automated solution could make it viable on a large scale. The In Codice Ratio research project aims to develop methods and tools to automate transcription and extract information in challenging contexts, such as from medieval manuscripts.
Engineer Elena Nieddu, supported by her mentor Lukasz Kaiser and by Sébastien Bratières, the Faculty Director of the AI School, further developed her academic thesis on this subject at Pi School. As a case study, the project focuses on the Vatican Registers, a collection containing the official transcripts of all papal correspondence from the 10th to the 16th century. The main focus is on the 13th-century Registers, as they represent a large part of the collection and share a similar handwriting style.
Automatic transcription of medieval handwritten documents is not a task that can be easily solved by optical character recognition: although medieval handwriting is fairly regular compared to its many modern variants, it is still very hard to isolate individual characters in each word, a necessary step in the classic OCR approach.
Moreover, many glyphs are written differently or employ different connectors depending on their position and the adjacent characters.
These challenges, as well as the state-of-the-art handwritten text recognition technology available today, indicate that a machine learning approach is preferable to an unfeasibly complex image analysis.
However, such models require supervision, and the data set needs to be very large to achieve a high level of accuracy.
Data labelling is an expensive and time-consuming process, especially in the case of medieval handwriting, where only a few experts have the knowledge required to provide correct transcriptions; they also need a fair amount of time to do so.
The goal of the Pi School project was to experiment with new approaches. In particular, the project aimed to test the viability of an end-to-end, sequence-to-sequence transcription system.
Current pipeline vs end-to-end approach
In order to minimise and scale the task of putting together a data set, the current project approach is to collect individual character labels – a very simple “image matching” task as opposed to expert transcription work – by providing researchers with positive examples of character images.
The labelled data collected was used to train a custom convolutional character classifier, achieving 96% accuracy on a held-out test set.
In short, the current approach enables easy data set collection at the cost of error-prone, highly pipelined character-by-character recognition.
Synthetic data set generation
The chosen approach leverages existing data to generate a synthetic data set of line images paired with their transcriptions, combining character images into words and then lines and using the Latin text corpus as a reference for the sentences.
In order to generate lines of text which were sufficiently realistic, previously omitted symbols could no longer be ignored: 45 new glyphs, including uppercase characters, punctuation and simple, frequent abbreviations were added to the 22 lowercase Latin characters by manually collecting 10-20 examples per glyph. This still leaves out many common abbreviations; these are hard to reproduce as their resolution does not map immediately into a sequence of characters, but rather depends on the word in question.
Generation process: In the script we are considering, the same word could either be written in full or abbreviated. In order to account for this complexity in the generation phase, where the correct set of character images has to be combined to form the word image, a dictionary mapping regular expressions to possible forms represents the relationship between certain commonly abbreviated substrings and their forms, expressed as a sequence of known image symbols.
For each word in the corpus, matches for the regular expressions are computed: this generates all the alternative scripts for the word. These alternatives are then represented as a directed acyclic graph, and the symbol sequence for the word is computed by choosing a random path from the graph’s source to its sink, i.e. from the first to the last character in the word.
Following this step, the symbol images are composed into words and the words into lines in a rather straightforward way, by aligning them in the middle of a new, blank line of similar size to the original manuscript images. The corresponding transcription is generated at the same time.
Using this method, over 120k sets of line images and transcriptions were generated as a synthetic data set for sequence-to-sequence prediction models.
The chosen model for sequence-to-sequence predictions was the Transformer model created by Google Brain, taken from the 2017 paper “Attention is All You Need”, as implemented in the Tensor2Tensor library.
Tensor2Tensor is a TensorFlow-based library for supervised learning with support for sequence tasks. It is actively used and maintained by researchers and engineers from the Google Brain team, including the mentor for this project, Lukasz Kaiser.
It includes common data sets and implementations for many recent deep learning architectures made by Google, including SliceNet, MultiModel, ByteNet and the Neural GPU.
Models, problems and data sets in Tensor2Tensor are modular and easy to extend, as the goal of the library is to help the deep learning community to replicate and experiment with state-of-the-art technology.
Transformer: A novel and promising sequence-to-sequence model for machine translation which has already improved the quality of cutting-edge translations from English into German and French.
While most recent approaches to sequence transduction involve complex recurrent or convolutional neural networks in an encoder-decoder configuration, Transformer is based entirely on attention mechanisms, thus gaining greatly in terms of efficiency: we were able to obtain our results at a fraction of the training costs of the best models from the literature (3 to 50 times less).
The model is also able to generalise well in order to perform other tasks: with little adaptation, the same network outperformed all but one of the previously proposed approaches to constituency parsing.
Efforts are now focused on testing Transformer’s generalisation capabilities with other inputs and domains such as image and video, captioning and classification.
Training: The model was trained for 500k steps on a synthetic data set of more than 120k line images with transcriptions. All the training hyperparameters used were the default values for the Transformer model.
Metrics: The metrics considered when evaluating the model were logarithmic perplexity, accuracy and top-5 accuracy, computed by symbol. The model achieved 50% accuracy, 87% top-5 accuracy and 1.6 perplexities on a held-out evaluation set with no tuning.
Future developments: Synthetic data generation can allow us to apply deep learning to tasks which have no large data set available. Using this approach, the In Codice Ratio project can benefit from a sequence-to-sequence prediction model while minimising the effort required for data set collection.
The work carried out so far can act as a baseline for future work, as there is an ongoing effort to apply the Transformer model to new domains such as image and video. The code and the synthetic data set will be made available to the public to encourage others to experiment and further refine the task.
As for future developments, the model needs tuning: the experiments that were carried out only considered a narrow subset of hyperparameters, mainly related to input resolution and compression, but the model has many more. Moreover, a smaller version of the model was tested as opposed to the standard version.
In order to generate more samples for the data set, more Latin text needs to be collected, potentially including religious text and text from other time periods and authors.
Future iterations of the crowdsourcing platform will take synthetic data generation into account, allowing for more realistic and diverse examples.