Case studies

Computer Vision: diving into a Hierarchical Vision Transformer

Pi School’s focus is on state-of-the-art solutions in artificial intelligence; in this post, we want to explain how we answered a request from a sponsor thanks to Computer Vision.
During Pi School of AI session 11, fellows Krzysztof Wos, Vijayasri Iyer and Dennis Rotondi, coached by Cristiano De Nobili, designed an end-to-end deep-learning solution for plant disease classification.

Let’s go deep into the task with this blog post.

Do you have a challenge in Computer Vision? We can help you!

According to FAO data, the global economy loses 220 Billion USD annually due to plant disease. Other secondary effects of this problem include:

  1. The loss of biodiversity.
  2. Shortage of food supply/famine.
  3. Threat to livelihood and trade between nations.

Due to the scarcity of available data for the problem, the fellows tested several models that work well with data shortage and data imbalance. The models tested by the fellows were as follow:

  1. Fine-tuned ResNet50 (Baseline)
  2. Classical ML (SVM, XgBoost)
  3. Hierarchical Vision Transformer

The hierarchical vision transformer classification model was the model with the best and most reliable performance out of all of them. Current classification methods tend to use a flat hierarchy of class labels in order to determine the final category of a new data point, in this case, an image. However, this can quickly become an issue if there is an imbalance of data across the various classes and with data scarcity. In this problem, the fellows were able to group the labels containing very few samples into bigger groups/families and hence partition the manifold of the solution space into sub-manifolds. The hierarchical labels reduced the amount of information needed by the labels further down the hierarchy, making the categorical predictions more accurate.

Of course, this comes with a few drawbacks as well. For example, the labels further down the hierarchy are highly dependent on the predicted labels at the previous levels. Thus any fluctuation in the prediction of the macro-groups can also affect the accuracy of the fine classification labels.

The fellows leveraged the power of pre-trained foundation models for vision, particularly the vision transformer model by Vaswani et al. and fine-tuned it with the addition of MLP layers and a hierarchical label mapping. The diagram below provides an overview of the model architecture.

They utilized the latest tools for containerization, distributed training, visualization and reproducibility, such as PyTorch Lightning, Docker and Gradio. The fellows achieved an accuracy of 73% on the macro groups (4 groups) and 49% on the micro groups (17 groups). They also projected the expected improvement of the model to be 3% accuracy points for every 1000 labeled samples added to the dataset.

The success of this project goes beyond food security and plant disease classification. The Hierarchical Vision Transformer model can be extended to other problems, such as customer segmentation, morphological galaxy classification, and histology sample classification in healthcare settings.

Do you have a challenge in Computer Vision?

We can help you! Get in touch with us.

Leads from blog posts - Computer Vision