In our recent research presented at CVPR 2022, we propose Multi-label Iterated Learning (MILe), a novel method for composing multi-label descriptions of images from single-label annotations. Sai Rajeswar Mudumba is first author for this paper. This blog post was written by Pau Rodriguez, a research scientist at ServiceNow. Marie-Ève Marchand and Sean Hughes also contributed as reviewers and editors of this article.
Bibtex:
@inproceedings{SaiRajeswarMudumba2022, author = Sai Rajeswar Mudumba and Pau Rodriguez and Soumye Singhal and David Vazquez and Aaron Courville, title = Multi-label Iterated Learning for Image Classification with Label Ambiguity, booktitle = Computer Vision and Pattern Recognition (CVPR), year = 2022}
Most of the datasets used to train image classification systems, such as ImageNet, are made using image-label pairs. While the image contains a picture of some object in some context, the label indicates what the object present in the image is most known as. If the essence of an image is worth a thousand words, it certainly might require more than a single label to describe its content. For example, Figure 1 contains a picture that would be typically labeled as “Bicycle” in the ImageNet database. However, any of the labels: “car”, “building”, “person”, “bag”, or “road” could apply to this same picture. By letting the model favor one object label over another, this results in unexpected consequences such as confounding bias, particularly where two labels occur frequently in the same image. In this instance, “bicycle” is favored over “person”, even though the cyclist is the main focus in the image.”
Figure 1. Bicycle. Source: Unsplash.
MILe achieves this goal with two key adaptations. The first removes the requirement of choosing one label over others when making a prediction. To do so, we substituted the Softmax activation function at the output of the neural network for a Sigmoid activation. The second adaptation is inspired by the iterated learning framework for the evolution of languages introduced by Kirby et al1. This framework shows that languages evolve to become compositional to be transmitted effectively from generation to generation. For instance, we could use the word foo for blue balls and var for red balls. However, we would need to learn a new name for each object-property combination, and we would forget the names of objects that we do not use daily. Furthermore, words that can be decomposed as objects and properties and recomposed in novel ways are reused more frequently and thus tend to be passed to the next generations. Accordingly, over the long term, complex and overly specific words are forgotten, and languages evolve to become compositional.
Figure 2. Images labeled with MILe. ReaL: ImageNet relabeling effort by Beyer et al. 2020. Sigmoid: ResNet-50 with sigmoid output activations. MILe: multi-label iterated learning (ours).
We ran MILe on some ambiguous ImageNet samples and show the predicted labels in Figure 2. Interestingly, MILe finds many of the labels annotated by humans (ReaL) and it even finds alternatives that were not labeled in the first place (such as notebook in the first image and potato on the second image in the second row). In the second image on the top row, we can also see how our models were able to find the pickelhaube, which was ignored by human annotations. These predictions indicate that MILe was able to compose the predictions from single-label annotations into multi-label descriptions of the images.
Results on ImageNet and WebVision
We evaluated MILe on two different settings, (1) the standard ImageNet dataset, and (2) WebVision, a uncurated dataset consisting of images downloaded from the Internet by querying image browsers with the different ImageNet labels. Results are displayed in Figure 3.
Figure 3. Image classification results. ImageNet accuracy (left), transfer from WebVision to ImageNet (center), WebVision only (right)
As can be observed, models trained with MILe achieve better accuracy and F1 score than their Vanilla counterparts. Since WebVision is a noisy dataset, we found the improvement attained with MILe encouraging. We hypothesize that the learning bottleneck makes MILe prioritize those labels that are cleaner and require less memorization, thus making it more robust to noise.
Learning like humans learn
Figure 4. IIRC benchmark (Abdelsalam et al.)
As we discussed, MILe slowly composes a multi-label representation of the images. Interestingly, this process is similar to human learning. First, we learn what a dog and a cat are and then we can learn the different breeds of dogs, and so on. Inspired by this phenomenon, Abdelsalam et al. (2021) introduced the Incremental Implicitly-Refined Classification (IIRC) benchmark (Figure 4). In this benchmark, labels are introduced incrementally from less to more specific, and models must remember and improve their knowledge in a continuous manner. To study whether MILe can progressively incorporate labels over time, we evaluated it on the IIRC benchmark achieving encouraging results (Figure 5).
Conclusion
We introduced multi-label iterated learning (MILe) to address the problem of label ambiguity and label noise in popular classification datasets such as ImageNet. MILe leverages iterated learning to build a rich supervisory signal from weak supervision. It relaxes the singly labeled classification problem to multi-label binary classification and alternates the training of a teacher and a student network to build a multi-label description of an image from single labels. The teacher and the student are trained for a few iterations to prevent them from overfitting the singly labeled noisy predictions. MILe improves the performance of image classifiers for the singly labeled and multi-label problems, domain generalization, semi-supervised learning, and continual learning on IIRC. Overall, we found that iterated learning improves the performance of models trained with weakly labeled data, helping them to overcome problems related to label ambiguity and noise. We hope that our research will open new avenues for iterated learning in the visual domain.
Figure 6. Multi-label Iterated Learning (MILe) builds a multi-label representation of the images from singly labeled ground-truth. In this example, a model produces multi-label binary predictions for the next generation, obtaining Car and House for an image weakly labeled with House.
Are you interested in doing research at ServiceNow?
If you are interested in exploring full-time career opportunities with the ServiceNow Research team or wish to learn more about the part-time Visiting Researcher Program (for research internships), please take a moment to fill out this form so that hiring managers can learn more about your background and we can contact you about our current openings.
Please note that ServiceNow Research internships through our Visiting Researcher Program starts and run all year and are not limited to "seasonal" applications.
Follow @ServiceNowRSRCH on Twitter for our latest news, and updates from the community, and to get in touch.
© 2022 ServiceNow, Inc. All rights reserved. ServiceNow, the ServiceNow logo, Now, and other ServiceNow marks are trademarks and/or registered trademarks of ServiceNow, Inc. in the United States and/or other countries. Other company names, product names, and logos may be trademarks of the respective companies with which they are associated.