Skip to content

6

Problem Statement

Over-represented datasets in model training data can lead to biases, causing models to perform well on these specific datasets but poorly on others. For instance, the CUB datasets might be over-represented in CLIP's training data.

My last week's post shared the results of CLIP zero-shot classification for CUB, CUB-2019, and CUB-2023: https://blogs.gwu.edu/pless/2024/06/10/comparative-study-of-clips-zero-shot-accuracy-on-cub-dataset/

This week, I want to find out the factors that make our accuracy 10% lower than CUB-200-2011. Is this discrepancy due to CUB being over-represented in CLIP's training data, or are there issues with our datasets?

Experiments

Exp1: Confusion Matrices

To evaluate CLIP's performance on CUB, CUB-2019, and CUB-2023, I plotted the following matrices to show the accuracy in predicting the correct classes. In the matrices, top5 predictions are included.

  • All three matrices show a clear diagonal line, indicating that CLIP correctly predicts the majority of the images in top5 results.
  • Misclassifications (the non-zero values scattered off the diagonals) are dispersed throughout the matrices, but clusters in certain areas. This suggests that CLIP tends to confuse certain classes, especially those that are visually similar such as different types of sparrows and warblers. The clusters are common across all three matrices.
  • The color scale on the right indicates the frequency of predictions. Brighter colors represent higher frequencies. The brightness of the three matrices are in the order of CUB > CUB-2019 > CUB-2023. To me, CUB and CUB-2019 exhibit similar brightness in this plot despite their numerical results.

Exp2: Minor Adjustments

My initial guess was that the original CUB, containing of 5994 training images and 5794 testing images, is twice as large as our CUB-2019 or CUB-2023 dataset. Our datasets have only as many images as the testing set. To eliminate the effect of dataset sizes, I only use testing set of CUB. It turns out the overall perfomance remains the same.

I came across the templates for zero-shot classification on CLIP's official git repository. And found the prompt for BirdSnap is 'a photo of a {class_name}, a type of bird.' Thinking it would be helpful for bird specific datasets, I adopted this prompt and got 1% improvements for each dataset.

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy53.1842.2941.58
Top-5 Accuracy84.0474.6173.84

Exp3: show stats and examples of images

To further investigate the results shown by the confusion matrices, I decided to examine the images themselves. I produced two set of images:

  1. Bar charts displaying difference in accuracies for classes between two datasets;
    1. CUB vs CUB 2019;
    2. CUB vs CUB 2023;
    3. CUB 2019 vs CUB 2023;
  2. Large composite images containing all images in per class for each dataset;

There are 8 classes in the original CUB datasets that are not presented in our CUB-2019 and CUB-2023 datasets due to insufficient images. Therefore, we exclude the first 8 bars from each chart.

(The chars are extremely long and gets blurry when I upload, I decided to share the images through google drive.)

I have not did a comprehensive analysis for all charts, but simply looking at the corresponding composite images in top categories with a lower score in CUB-2019 or CUB-2023 provide some insights.

Indigo Bunting in CUB-200-2011
Indigo Bunting in CUB-2019
Indigo Bunting in CUB-2023

Let's look at another example:

Purple Finch in CUB-200-2011
Purple Finch in CUB-2019
Purple Finch in CUB-2023

Both Purple Finch and Indigo Bunting are bird species with very distinct appearances between male and female birds. In the original CUB dataset, the majority of the images are of male birds, which have more prominent features that make them easier to classify. However, our dataset includes more images of female and immature birds, resulting in nearly a 50% reduction in accuracy for this class compared to the original CUB.
Aside from the female and immature birds, our dataset also includes photos taken from different angles, dark environment, and distracting backgrounds, all of which increase the difficulty of classification.

Exp3: Change class name (inspired by Abby)

Although CLIP's offcially posted prompts do not include hyphens in bird names, I personally think class names like "Black-footed Albatross" are more common in the real-word and would therefore yield better results than "Black footed Albatross". So I conducted experiments by adding the hyphens back to the class names.

text inputCUB w/ hyphenCUB w/o hyphen
Top153.4953.19
Top584.5484.04

However, adding the hyphen only results in a slight improvement overall. A closer look at the per class results reveals that some classes did see significant improvements. However, not all classes with a similar format like "Black-footed Albatross" improves as expected; some even experienced a decrease in accuracy. My current understanding of CLIP does not explain this result. If anyone has insights into this, please feel free to share in the comments.

Conclusion

At the end of the day, I cannot make an arbitrary conclusion like "CUB is over-represented in CLIP dataset." But, I am convinced that our dataset has its strength: we have more female birds, immature birds, and photos in more angles. However, I am confident that our dataset has its strengths: we feature more female birds, immature birds, and photos captured from various angles. Moving forward, I am planning to conduct a comparative study between CUB-2019 and CUB-2023, and analyzing the images in a different, more quantitative manner.

The primary goal of this study is to establish baseline CLIP zero-shot results for a newly completed dataset.

The analysis includes the original CUB-200-2011 dataset, CUB 2019 dataset which may have been included in the training data of CLIP, and the CUB 2023 dataset, which consists of images that are not part of CLIP's training data.

Results

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy51.8641.5140.89
Top-5 Accuracy82.8673.7472.43

Comparison with Published Findings

In "If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions," six fine-grained image classification datasets, including CUB-200-2011, were evaluated using CLIP to analyze how Vision-Language Models prioritize information. The zero-shot accuracy of CLIP on the CUB dataset was reported as 51.4% (Esfandiarpoor et al., 2024).

In the study "How Well Does CLIP Understand Texture?" the authors performed zero-shot learning on various texture and material classification datasets. The zero-shot classification top-k accuracy on the CUB dataset is presented in the table within the paper. Notably, the top-1 accuracy using default class names (common names) was 51.8%. A significant finding of this paper is the substantial performance drop observed when common names were replaced with scientific names, genus, family, and order names (Wu and Maji., 2022).

In the paper "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts," the authors reported a zero-shot transfer accuracy of 54.70% on the CUB dataset using the ViT-B/16 backbone model (Maniparambil et al., 2023). Using the model ViT-B/16 in my experiments also resulted in a 55% accuracy.

Reference

Esfandiarpoor, Reza, Cristina Menghini, and Stephen H. Bach. “If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions.” arXiv, March 25, 2024. https://doi.org/10.48550/arXiv.2403.16442.

Maniparambil, Mayug, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E. O’Connor. “Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts,” 262–71, 2023. https://openaccess.thecvf.com/content/ICCV2023W/MMFM/html/Maniparambil_Enhancing_CLIP_with_GPT-4_Harnessing_Visual_Descriptions_as_Prompts_ICCVW_2023_paper.html.

Wu, Chenyun, and Subhransu Maji. “How Well Does CLIP Understand Texture?” arXiv, November 4, 2022. http://arxiv.org/abs/2203.11449.

This week I am working on reimplementing experiments in the field of fine-grained visual classification.The data set used for this study is CUB-200-2011, a fine-grained bird classification dataset.

Summary

MethodTop-1 Accuracy - My ResultTop-1 Accuracy - Original ResultCode
FFVT91.6291.6link
Vit-NeT91.691.7link
TransFGNA91.7link
IELT91.26791.8link
SACNA91.8link
HERBS93.0193.1link

Details

  • FFVT
  • IELT
  • Vit-Net
  • HERBS
  • TransFG: Due to GPU limitations, the training steps were not completed. However, I successfully migrated the workflow to Google Colab and optimized the data loading steps, reducing the time from 40 minutes to 2 minutes.
  • SAC: Learned how to set up a virtual environment to incorporate TensorFlow 1.45 and Python 3.7, but encountered issues on a MacBook due to hardware limitations.