Orrick, the global law firm focused on the tech & innovation, energy & infrastructure and finance sectors, partnered with Pi School during School of AI Session 11. Orrick asked Pi School to develop a Machine Learning tool that could draw on market and proprietary data to uncover sector and company insights that inform client solutions.
Fellows Pinku Deb Nath, Menan Velayuthan and Maria Natalia Herrera, with the help of their coach Francesco Cariaggi, developed a Natural Language Processing (NLP) system that solves this problem and is easy to implement in 8 weeks.
Orrick’s wish was to segment new data according to predefined categories. Orrick needs an algorithmic labelling system tailored to those categories to expedite the process and determine the class to which a new client belongs.
A standard classification of economic activities allows organizing entities according to their actions, production processes, or behaviour in financial markets. It promotes the consolidation of information, thereby allowing the development of essential decisions about necessary investments and regulations to support industrial growth further.
Different organizations adopt several types of taxonomies of varying granularity based on various criteria, but they do not embrace a standard. The need for set standards makes consolidating information across multiple data sources a big challenge when companies try to adopt them as part of their daily operations.
Machine learning allows machines to automatically learn from sample data to identify patterns and make predictions with minimal human intervention.
Automation helps companies perform tasks faster and more efficiently, save time on repetitive tasks and allocate employees’ time to more meaningful assignments.
Machine Learning has already been presented as an alternative to address the problem of industry classification based on textual descriptions and Natural Language Processing techniques are typically part of the pipeline.
We selected a Zero-shot text classification model from the Hugging Face model hub as a baseline for this challenge. We decided on this model because of the number of downloads, since it is a good indicator of popularity and trustworthiness. This model is particularly useful because it allows us to associate an appropriate label to a given sample even though the model has never come across samples tagged with the same label.
As a more advanced approach, we proposed a DistilBERT model. It is a small, fast, cheap and lightweight model based on the BERT architecture. The results achieved by the DistilBERT model were very good: the prediction accuracy for sector codes (a 4-value taxonomy) exceeded 80%. For this challenge, we provided an algorithmic solution and an end-to-end infrastructure that allows our model to be deployed on a cloud platform, Microsoft Azure, and easily integrated with the sponsor software ecosystem. This solution can predict, store and re-learn from the data it comes across in the production environment.