We are excited to announce that our team at Pi School, composed of Francesco Cariaggi, Cristiano De Nobili and Sébastien Bratières, had the chance to work on a follow-up to the Patents4IPPC project, making the Information Retrieval (IR) system initially built even better.
The project was initially carried out in collaboration with the Directorate Growth and Innovation of the Joint Research Centre (JRC) of the European Commission to map environmental capabilities across the EU and the world to meet the standards for clean technology defined by the European Green Deal.
The current update focused on two important points:
Ever since we built our first IR system in the context of the Patents4IPPC project, we knew that retrieving patents based solely on their abstract was not enough. In fact, while abstracts provide succinct descriptions of the inventions proposed in the patents, often, the most relevant pieces of information are contained in the claims. On the other hand, the models we used in the original Patents4IPPC project (based on BERT) couldn’t easily deal with long texts without requiring extremely sophisticated hardware.
For this reason, in this follow-up project, we proposed using different model architectures (Longformer, Hierarchical Transformer) specifically designed to deal with long texts, thus allowing us to combine patent abstracts and claims.
Given that our models could now look at both abstracts and claims to determine the relevance of a given patent with respect to a BREF document, a new issue arose regarding how humans could validate the predictions of our model. While previously it was only a matter of skimming through the few lines of text that make up a patent abstract, now we would also need to read all the claims. This might not seem like a big deal, but the reality is that some patents may contain hundreds of claims (for example, patent US6684189B1 has 887). As a result, knowing which parts of a patent are deemed most relevant by our IR system suddenly becomes crucial.
To this end, we employed an input attribution algorithm called Integrated Gradients that helps us understand which specific parts of the input (in our case, abstract and individual claims) contribute the most to our predictions. Note that the attribution scores computed by the algorithm for each part of the input give us an idea of how relevant those parts are with respect to the BREF document, which is the second component of our model’s input.
We are delighted to have contributed to this project for two reasons. The first is the satisfaction of applying AI techniques to impactful challenges, such as achieving zero pollution. The second is the opportunity to explore frontier topics, such as long-text embedders and explainability methods.