Translated is a leading online provider of professional translations. The company combines technology with human translation processes in order to provide customers with cast-iron quality guarantees while also cutting turnaround times and the cost of professional translations. Since it was founded, Translated has ended every financial year with a net profit, and today the company can count on a team of 279,309 professional translators and an extensive portfolio of clients which includes multinationals like Hewlett Packard, IBM and Google.
Video of the final presentation of the project
Currently, accurate language identification techniques require studio-quality audio samples lasting about 10 seconds. For strategic business purposes, Translated wants to work with real-life conditions and thus be able to use short audio clips featuring noisy speech. The company therefore decided to sponsor a grant for engineer Rimvydas Naktinis, who was supported by his mentor, Lukasz Kaiser of Google Brain, and by the Faculty Director of the Artificial Intelligence Programme, Sébastien Bratières.
Being able to identify languages from short spoken utterances can be useful in many situations such as multilingual speech recognition, translation, call centre optimisation, and automatic data labelling.
Multiple neural network models were explored in this work, as it has previously been claimed that such architectures produce satisfactory results (Bartz et al., 2017; Harutyunyan & Khachatrian, 2016; Montavon, 2009). As this family of models is able to automatically extract the relevant features, the only pre-processing step was to convert the audio from the temporal domain to the frequency domain.
The goal of spoken language identification is to assign language labels to short (usually 3-10 second) audio files containing utterances in one of the languages from a predefined set. In some applications, such as multilingual speech recognition or call centre routing, an additional performance constraint is introduced which effectively rules out models requiring extensive data pre-processing.
In most of the potential applications, there is no control over the data source, so it is important to consider the behaviour of the model when exposed to different amounts and distributions of noise, differences in microphone quality, the presence of multiple speakers, different accents, and the various genders and ages of speakers.
To train and evaluate the model, two data sets were used: VoxForge and Audio Lingua. Both data sets are available to download for research purposes and contain samples of varying quality in multiple languages. Both data sets include samples from male and female speakers.
In the main experiment, the model was trained in six languages: English, French, German, Italian, Portuguese and Spanish.
Training and validation datasets. A combination of the VoxForge and Audio Lingua data sets was used for training and validation. The training set contained 12,389 files and the validation set contained 1,387 files. The combined data set comprised about 4 hours of speech per language.
Data preparation. The audio files were converted to image representation using a log magnitude spectrogram with a Hann window, 23 msec width and 12 msec shift. The 5.5kHz range was divided into 128 bins. A sampling rate of 44kHz was used throughout the experiments. The conversion was performed using Librosa Python Library.
Convolutional recursive neural network. The study proposes a system which is a combination of convolutional and recursive neural networks. The convolutional layers can potentially capture the common patterns of frequency changes in the spectrogram, while the recurrent layer is intended to capture the sequential temporal evolution of such patterns.
The architecture of the best-performing model was as follows. The spectrograms were encoded as PNG images of variable width and a height of 128 pixels (height representing the number of frequency bins). The spectrograms were provided as an input to a 2D convolutional layer with 16 filters of 7 x 7 pixels (strides of 1 and ReLU activation functions were used in all convolutional layers), followed by a dropout layer with a 0.5 dropout rate and a max pooling layer of size 3 and (2, 1). After the pooling layer, layer normalisation was performed.
This was followed by three more similar convolutional layers with dropout, max pooling and layer normalisation. The parameters of these further convolutional layers are as follows: a layer of 32 5 x 5 filters, and two layers of 32 3 x 3 filters each. The pooling and dropout parameters were identical to the first convolutional layer.
The output of the last layer-normalised max pooling layer served as the input to a GRU cell with 128 hidden nodes followed by layer normalisation and dropout (rate of 0.5). Finally, this served as the input to a fully-connected layer with as many output nodes as the number of languages being considered and a softmax activation function.
A cross-entropy loss function with regularisation (lambda of 0.001) was optimised using an Adam Optimiser. The learning rate was set to 0.001 for 300 epochs and lowered to 0.0001 for 60 further epochs.
The best model achieved an accuracy level of 86.81%, identifying one of six languages from a 10% validation set containing only recordings of speakers not heard during training. This data set contained 1,387 samples ranging from 1 to 22 seconds long, so the accuracy of this test set represents an average for test samples of different durations.
Additionally, three separate data sets containing samples of 3, 5, and 10 seconds long respectively were extracted from the validation set. The accuracy of the model was measured for each fixed duration sample set. The results are provided in the table below.
|Duration||3 seconds||5 seconds||10 seconds|
Finally, the same neural network architecture was also trained and tested on a data set of 3 languages – English, German and French (VoxForge only) – to compare it with the model proposed by Montavon (2009), which was trained on a data set of the same size and containing only samples of the same languages. Our 3-language model achieves an accuracy of 88.94%, beating Montavon (2009), which achieved 80.1%.
The project was able to demonstrate a model of spoken language identification from short spoken utterances for 3 and 6 languages. The model achieved accuracy levels of over 86% and 88% respectively for validation sets.
By investigating multiple data sets (including VoxForge, which is also used in some other studies), the research was also able to draw attention to the shortcomings of these data sets and the importance of controlling various aspects of the samples.
For any future studies of spoken language identification, the study recommends reporting whether there was an overlap of speaker identities in their training and test data sets and, if possible, even providing the statistics of all known aspects of the samples in the training set (such as gender, channel type, speech rate, etc.).