Skip to content

Understanding Dataset Bias in CLIP – Insights from Comparative Analysis of CUB datasets

Problem Statement

Over-represented datasets in model training data can lead to biases, causing models to perform well on these specific datasets but poorly on others. For instance, the CUB datasets might be over-represented in CLIP's training data.

My last week's post shared the results of CLIP zero-shot classification for CUB, CUB-2019, and CUB-2023: https://blogs.gwu.edu/pless/2024/06/10/comparative-study-of-clips-zero-shot-accuracy-on-cub-dataset/

This week, I want to find out the factors that make our accuracy 10% lower than CUB-200-2011. Is this discrepancy due to CUB being over-represented in CLIP's training data, or are there issues with our datasets?

Experiments

Exp1: Confusion Matrices

To evaluate CLIP's performance on CUB, CUB-2019, and CUB-2023, I plotted the following matrices to show the accuracy in predicting the correct classes. In the matrices, top5 predictions are included.

  • All three matrices show a clear diagonal line, indicating that CLIP correctly predicts the majority of the images in top5 results.
  • Misclassifications (the non-zero values scattered off the diagonals) are dispersed throughout the matrices, but clusters in certain areas. This suggests that CLIP tends to confuse certain classes, especially those that are visually similar such as different types of sparrows and warblers. The clusters are common across all three matrices.
  • The color scale on the right indicates the frequency of predictions. Brighter colors represent higher frequencies. The brightness of the three matrices are in the order of CUB > CUB-2019 > CUB-2023. To me, CUB and CUB-2019 exhibit similar brightness in this plot despite their numerical results.

Exp2: Minor Adjustments

My initial guess was that the original CUB, containing of 5994 training images and 5794 testing images, is twice as large as our CUB-2019 or CUB-2023 dataset. Our datasets have only as many images as the testing set. To eliminate the effect of dataset sizes, I only use testing set of CUB. It turns out the overall perfomance remains the same.

I came across the templates for zero-shot classification on CLIP's official git repository. And found the prompt for BirdSnap is 'a photo of a {class_name}, a type of bird.' Thinking it would be helpful for bird specific datasets, I adopted this prompt and got 1% improvements for each dataset.

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy53.1842.2941.58
Top-5 Accuracy84.0474.6173.84

Exp3: show stats and examples of images

To further investigate the results shown by the confusion matrices, I decided to examine the images themselves. I produced two set of images:

  1. Bar charts displaying difference in accuracies for classes between two datasets;
    1. CUB vs CUB 2019;
    2. CUB vs CUB 2023;
    3. CUB 2019 vs CUB 2023;
  2. Large composite images containing all images in per class for each dataset;

There are 8 classes in the original CUB datasets that are not presented in our CUB-2019 and CUB-2023 datasets due to insufficient images. Therefore, we exclude the first 8 bars from each chart.

(The chars are extremely long and gets blurry when I upload, I decided to share the images through google drive.)

I have not did a comprehensive analysis for all charts, but simply looking at the corresponding composite images in top categories with a lower score in CUB-2019 or CUB-2023 provide some insights.

Indigo Bunting in CUB-200-2011
Indigo Bunting in CUB-2019
Indigo Bunting in CUB-2023

Let's look at another example:

Purple Finch in CUB-200-2011
Purple Finch in CUB-2019
Purple Finch in CUB-2023

Both Purple Finch and Indigo Bunting are bird species with very distinct appearances between male and female birds. In the original CUB dataset, the majority of the images are of male birds, which have more prominent features that make them easier to classify. However, our dataset includes more images of female and immature birds, resulting in nearly a 50% reduction in accuracy for this class compared to the original CUB.
Aside from the female and immature birds, our dataset also includes photos taken from different angles, dark environment, and distracting backgrounds, all of which increase the difficulty of classification.

Exp3: Change class name (inspired by Abby)

Although CLIP's offcially posted prompts do not include hyphens in bird names, I personally think class names like "Black-footed Albatross" are more common in the real-word and would therefore yield better results than "Black footed Albatross". So I conducted experiments by adding the hyphens back to the class names.

text inputCUB w/ hyphenCUB w/o hyphen
Top153.4953.19
Top584.5484.04

However, adding the hyphen only results in a slight improvement overall. A closer look at the per class results reveals that some classes did see significant improvements. However, not all classes with a similar format like "Black-footed Albatross" improves as expected; some even experienced a decrease in accuracy. My current understanding of CLIP does not explain this result. If anyone has insights into this, please feel free to share in the comments.

Conclusion

At the end of the day, I cannot make an arbitrary conclusion like "CUB is over-represented in CLIP dataset." But, I am convinced that our dataset has its strength: we have more female birds, immature birds, and photos in more angles. However, I am confident that our dataset has its strengths: we feature more female birds, immature birds, and photos captured from various angles. Moving forward, I am planning to conduct a comparative study between CUB-2019 and CUB-2023, and analyzing the images in a different, more quantitative manner.

6 thoughts on “Understanding Dataset Bias in CLIP – Insights from Comparative Analysis of CUB datasets

  1. yu.wu1

    fantastic and interesting work!👏

    I have a quick thought on the hyphen stuff.

    Do you add a hyphen between every consecutive two words?

    For example, the last species on your last graph is Loggerhead Shrike. I guess we shouldn't add a hyphen to make it: "Loggerhead-Shrike". And if we don't add a hyphen to Loggerhead Shrike, the accuracy should not change.

    Determining when to add a hyphen is hard, but I guess it can be done by prompting chatGPT with the 200 species name and asking to respond in a JSON format.

    Reply
    1. Le Sun

      Hi Yu, thank you for the suggestion!

      The figure is a bit misleading, and I should've shown the names with hyphen.

      Basically, I only add hyphen for names with more than two words, so "Loggerhead Shrike" would remain the same. But there are still many exceptions, for example, "Cape May Warbler" should not be written as "Cape-May Warbler". Using chatGPT is a great idea, I should try again later this week.

      Reply
  2. Robert

    Great post!

    I wonder if someone can find out how the CLIP tokenizer actually handles hyphens?

    Hyphen serve the purpos of "mores strongly connecting" words. I wonder if the hyphen is most important when the words are relatively common. For example

    "groove" and "billed" and "ani" are all common words, unrelated to birds, so connecting them harder is helpful.

    while anything "vireo" or "warbler" is for sure a bird and then the hyphenation matters less.

    Reply
    1. yu.wu1

      I think the CLIP tokenizer handles a hyphen as a separator.

      For example, "Cape-May Warbler" is tokenized as ['', 'cape', '-', 'may', 'warbler', '']

      Reply
  3. Abby

    Hi! This is all very cool and I'm really excited about some of the results.

    Is it possible that the montages are currently all showing the same figure? There shouldn't be any images that exist in both the 2023 data and the 2011 data but the montages all appear the same for me.

    An alternative visualization that might be a bit simpler for directly comparing across classes would be something like a 3x10 grid of images where each row corresponds to one of the datasets, and in each row the first 5 images are 5 (random?) images we get right and the the second 5 images are 5 (random?) images we get wrong. I think random is the right choice, but it could also be your five MOST correct images and your five LEAST correct images? And maybe we want to see more than 5 each, but that's a reasonable starting point? But I think we just generally want to get a sense of whether there are kind of fundamental differences in the distributions of the images in our new dataset that could be causing the decrease in performance.

    Reply
    1. Le Sun

      Hi Abby, thanks for your feedback. I’m having trouble updating the correct image on the site. I will try a different representation in my next post, incorporating your suggested approach to avoid any confusion.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *