Skip to content

Problem Statement

Previously, we made two great dataset CUB-2019 and CUB-2023. However, the zero-shot accuracies using CLIP on these datasets did not match the performance of the original CUB. One possible reason for this is that CLIP may have overfitted on the original CUB dataset. As a result, the zero-shot performance, or "Validation Accuracy", on our CUB-2019 and CUB-2023 is lower compared to the "Training Accuracy". To validate this thoughts, I decided to examine the CLIP's training set in more detail.

Approach

The LAION-400M Dataset

The CLIP's training dataset has not been publicly released, but it is known to be derived from publicly available sources across the Internet. Open-CLIP, an open source implementation of CLIP, is trained on datasets like LAION-400MLAION-2B, both of which originate from Common Crawl, the largest open web crawl data source. I selected the model trained on LAION-400M, as LAION-400M has similar size to the CLIP's training set, the WebImageText, and it shows similar performance to CLIP on on the three CUB datasets:

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy56.3545.0644.08
Top-5 Accuracy86.3176.8275.19
Open-CLIP's performance on three datasets
datasetCUBCUB-2019CUB-2023
Top-1 Accuracy51.8641.5140.89
Top-5 Accuracy82.8673.7472.43
CLIP's performance on three datasets

Finding the Subsets

The dataset consists 413 million text-image pairs, so tracing 12000 images feels like searching needles in the Ocean. To speedup the process, I chose to filter with specific keywords. Luckily the metadata of the dataset provides sample id, url, and text descriptions for each image. This allows me to query images by captions or domains.

The inaturalist Subset

The CUB-2019 and CUB-2023 datasets are derived from inaturalist.org. However, a query on "www.inaturalist.org" returns only 540 results, most of which are non-bird data (insects, fish, plants, etc). Therefore, CUB-2019 and CUB-2023 are clearly not part of the CLIP's training set.

The CUB classes subset

The CUB dataset does not provide image urls, and popular image source finding tools - tinEye search, and google reverse image search does not give satisfiable results in find the image sources. Therefore, I changed my strategies to create subsets of all possible images corresponding to the 200 bird species listed in the CUB dataset.

I then made 200 queries for images with any of the 200 classes in CUB dataset. All the matching results are stored in an Excel file, with one sheet for each species.

The images, although with bird species in the paring text, have non-bird objects(like, phone case, tote bags, hoodie, or postcards). I performed a preliminary filtering to remove these images.

Image Hash

Perceptual hashing is a robust algorithm that detects minor modifications in images, such as cropping, slight changes in brightness and contrast. It is very effective for identifying duplicate or near-duplicate images.

Hashing images from the CUB dataset is straightforward since all the images are available. I simply compute the hash values for each image and save the information in an Excel file. This file also contains 200 sheets, each corresponding to one of the 200 species, making it easier to compare images of the same species.

The LAION-400M official website provides the instructions to access the images using img2dataset. So I could download the images in batch and then do the perceptual hash. The hash values are saved back to the spreadsheet.

Image Matching

Both the spreadsheets of CUB and LAION-400M are structured as follows:

File name/ File ID | File Path | Hash Value

I could simply measure any two images' similarity by computing their hamming distance (the difference between two hash values).

Since each species have its own sheet, I can compare hash values only for images within the same class. This approach avoids the issue of matching images from different species.

Thresholding and results

Setting the threshold is challenging. With a threshold of 10, we identified 166 pairs; with a threshold of 12, we found 541 pairs. There are 1,234 image pairs with a Hamming distance of less than 15. Even with low distances, some images appear relatively different or have significantly different content.

In the following examples, the columns shows index-difference, CUB image, LAION-400M image.

Concerns

The Text Captions

Some are concerned that images in CUB dataset might also exist in LAION-400M but with different captions, such as "photo of a yellow bird". Querying only with species names in my initial step might miss a significant amount of images.

To address this, I reviewed the CUB-200 paper. The articles mentions using species names as query terms on Flickr and employing Amazon Mechanical Turk for image filtering and annotating. The process did not involves expert bird identification. Therefore, I believe using the same species names for querying is reasonable.

Unavailable URLs

While using img2dataset to download relevant images from LAION-400M, some downloads failed because the images are saved in URL format and some URLs are no longer available. This indicates that there are images in the training set that are no longer accessible for tracing.

Sample report by image2dataset program, showing fail-to-download image count and error log.

The Near-Duplicates

While examining the results, I have observed images that look very similar but not identical, for example, photos of the same species taken from the same angle.

Although they are not identical, their similarity raises a question: Could such photos be indicative of overtraining by CLIP? The answer is not yet known, but I plan to explore this in my future research.

6

Problem Statement

Over-represented datasets in model training data can lead to biases, causing models to perform well on these specific datasets but poorly on others. For instance, the CUB datasets might be over-represented in CLIP's training data.

My last week's post shared the results of CLIP zero-shot classification for CUB, CUB-2019, and CUB-2023: https://blogs.gwu.edu/pless/2024/06/10/comparative-study-of-clips-zero-shot-accuracy-on-cub-dataset/

This week, I want to find out the factors that make our accuracy 10% lower than CUB-200-2011. Is this discrepancy due to CUB being over-represented in CLIP's training data, or are there issues with our datasets?

Experiments

Exp1: Confusion Matrices

To evaluate CLIP's performance on CUB, CUB-2019, and CUB-2023, I plotted the following matrices to show the accuracy in predicting the correct classes. In the matrices, top5 predictions are included.

  • All three matrices show a clear diagonal line, indicating that CLIP correctly predicts the majority of the images in top5 results.
  • Misclassifications (the non-zero values scattered off the diagonals) are dispersed throughout the matrices, but clusters in certain areas. This suggests that CLIP tends to confuse certain classes, especially those that are visually similar such as different types of sparrows and warblers. The clusters are common across all three matrices.
  • The color scale on the right indicates the frequency of predictions. Brighter colors represent higher frequencies. The brightness of the three matrices are in the order of CUB > CUB-2019 > CUB-2023. To me, CUB and CUB-2019 exhibit similar brightness in this plot despite their numerical results.

Exp2: Minor Adjustments

My initial guess was that the original CUB, containing of 5994 training images and 5794 testing images, is twice as large as our CUB-2019 or CUB-2023 dataset. Our datasets have only as many images as the testing set. To eliminate the effect of dataset sizes, I only use testing set of CUB. It turns out the overall perfomance remains the same.

I came across the templates for zero-shot classification on CLIP's official git repository. And found the prompt for BirdSnap is 'a photo of a {class_name}, a type of bird.' Thinking it would be helpful for bird specific datasets, I adopted this prompt and got 1% improvements for each dataset.

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy53.1842.2941.58
Top-5 Accuracy84.0474.6173.84

Exp3: show stats and examples of images

To further investigate the results shown by the confusion matrices, I decided to examine the images themselves. I produced two set of images:

  1. Bar charts displaying difference in accuracies for classes between two datasets;
    1. CUB vs CUB 2019;
    2. CUB vs CUB 2023;
    3. CUB 2019 vs CUB 2023;
  2. Large composite images containing all images in per class for each dataset;

There are 8 classes in the original CUB datasets that are not presented in our CUB-2019 and CUB-2023 datasets due to insufficient images. Therefore, we exclude the first 8 bars from each chart.

(The chars are extremely long and gets blurry when I upload, I decided to share the images through google drive.)

I have not did a comprehensive analysis for all charts, but simply looking at the corresponding composite images in top categories with a lower score in CUB-2019 or CUB-2023 provide some insights.

Indigo Bunting in CUB-200-2011
Indigo Bunting in CUB-2019
Indigo Bunting in CUB-2023

Let's look at another example:

Purple Finch in CUB-200-2011
Purple Finch in CUB-2019
Purple Finch in CUB-2023

Both Purple Finch and Indigo Bunting are bird species with very distinct appearances between male and female birds. In the original CUB dataset, the majority of the images are of male birds, which have more prominent features that make them easier to classify. However, our dataset includes more images of female and immature birds, resulting in nearly a 50% reduction in accuracy for this class compared to the original CUB.
Aside from the female and immature birds, our dataset also includes photos taken from different angles, dark environment, and distracting backgrounds, all of which increase the difficulty of classification.

Exp3: Change class name (inspired by Abby)

Although CLIP's offcially posted prompts do not include hyphens in bird names, I personally think class names like "Black-footed Albatross" are more common in the real-word and would therefore yield better results than "Black footed Albatross". So I conducted experiments by adding the hyphens back to the class names.

text inputCUB w/ hyphenCUB w/o hyphen
Top153.4953.19
Top584.5484.04

However, adding the hyphen only results in a slight improvement overall. A closer look at the per class results reveals that some classes did see significant improvements. However, not all classes with a similar format like "Black-footed Albatross" improves as expected; some even experienced a decrease in accuracy. My current understanding of CLIP does not explain this result. If anyone has insights into this, please feel free to share in the comments.

Conclusion

At the end of the day, I cannot make an arbitrary conclusion like "CUB is over-represented in CLIP dataset." But, I am convinced that our dataset has its strength: we have more female birds, immature birds, and photos in more angles. However, I am confident that our dataset has its strengths: we feature more female birds, immature birds, and photos captured from various angles. Moving forward, I am planning to conduct a comparative study between CUB-2019 and CUB-2023, and analyzing the images in a different, more quantitative manner.

The primary goal of this study is to establish baseline CLIP zero-shot results for a newly completed dataset.

The analysis includes the original CUB-200-2011 dataset, CUB 2019 dataset which may have been included in the training data of CLIP, and the CUB 2023 dataset, which consists of images that are not part of CLIP's training data.

Results

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy51.8641.5140.89
Top-5 Accuracy82.8673.7472.43

Comparison with Published Findings

In "If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions," six fine-grained image classification datasets, including CUB-200-2011, were evaluated using CLIP to analyze how Vision-Language Models prioritize information. The zero-shot accuracy of CLIP on the CUB dataset was reported as 51.4% (Esfandiarpoor et al., 2024).

In the study "How Well Does CLIP Understand Texture?" the authors performed zero-shot learning on various texture and material classification datasets. The zero-shot classification top-k accuracy on the CUB dataset is presented in the table within the paper. Notably, the top-1 accuracy using default class names (common names) was 51.8%. A significant finding of this paper is the substantial performance drop observed when common names were replaced with scientific names, genus, family, and order names (Wu and Maji., 2022).

In the paper "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts," the authors reported a zero-shot transfer accuracy of 54.70% on the CUB dataset using the ViT-B/16 backbone model (Maniparambil et al., 2023). Using the model ViT-B/16 in my experiments also resulted in a 55% accuracy.

Reference

Esfandiarpoor, Reza, Cristina Menghini, and Stephen H. Bach. “If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions.” arXiv, March 25, 2024. https://doi.org/10.48550/arXiv.2403.16442.

Maniparambil, Mayug, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E. O’Connor. “Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts,” 262–71, 2023. https://openaccess.thecvf.com/content/ICCV2023W/MMFM/html/Maniparambil_Enhancing_CLIP_with_GPT-4_Harnessing_Visual_Descriptions_as_Prompts_ICCVW_2023_paper.html.

Wu, Chenyun, and Subhransu Maji. “How Well Does CLIP Understand Texture?” arXiv, November 4, 2022. http://arxiv.org/abs/2203.11449.

This week I am working on reimplementing experiments in the field of fine-grained visual classification.The data set used for this study is CUB-200-2011, a fine-grained bird classification dataset.

Summary

MethodTop-1 Accuracy - My ResultTop-1 Accuracy - Original ResultCode
FFVT91.6291.6link
Vit-NeT91.691.7link
TransFGNA91.7link
IELT91.26791.8link
SACNA91.8link
HERBS93.0193.1link

Details

  • FFVT
  • IELT
  • Vit-Net
  • HERBS
  • TransFG: Due to GPU limitations, the training steps were not completed. However, I successfully migrated the workflow to Google Colab and optimized the data loading steps, reducing the time from 40 minutes to 2 minutes.
  • SAC: Learned how to set up a virtual environment to incorporate TensorFlow 1.45 and Python 3.7, but encountered issues on a MacBook due to hardware limitations.