A circular economy with zero pollution is not just a dream anymore. We are proud to say that Pi School is working to make it real. We presented the project during the AIforPeople conference 2021 and are happy to share more details in this article.
In collaboration with the Directorate Growth and Innovation of the Joint Research Centre of the European Commission, Pi School worked on the Patents4IPPC, Patents for Industrial Pollution Prevention and Control project, which involved the following people: Francesco Cariaggi, Deep Learning Engineer; Cristiano De Nobili, Deep Learning Research Scientist and Sébastien Bratières, Managing Director at Pi School.
Pi School’s objective in this project was to map environmental capabilities across the EU and the world, analyze them using tools of Economic Complexity, and check if they matched the need for clean technology as specified by the European Green Deal. One way to achieve this was to retrieve geo-localized documents describing R&D activities, particularly patents, across the globe.
The European Green Deal aims to achieve a circular economy with zero pollution by 2050. To this end, the Directorate Growth and Innovation of the Joint Research Centre of the European Commission regularly compiles Best Available Techniques (BAT) reference documents, also known as BREFs, which give a clear picture of the state-of-the-art in industrial pollution prevention and control in all European Member States.
We set out to build an Information Retrieval (IR) system to recover relevant patents from queries based on specific subsections of BREF documents, leveraging the most recent advances in the field of Natural Language Processing (NLP). In particular, we created an IR engine based on the Transformer architecture (specifically BERT), enhanced with a siamese structure for sentence embedding (specifically SBERT) and supported by FAISS indexing. After several domain-specific self-supervised masked language model (MLM) adaptive-tunings and supervised semantic similarity fine-tunings, our best model demonstrated its superiority over legacy approaches, observing up to a 240% relative improvement in performance metrics (Spearman Rank Correlation). We trained and fine-tuned our model using several open-source datasets and assessed its effectiveness by comparing its performance with baseline approaches on a brand new dataset provided by JRC.
Curious to see how it works? Here is the link to the GitHub repository.
We are happy to report that the JRC appreciated our work and decided to support further developments. Stay tuned!