The primary goal of this study is to establish baseline CLIP zero-shot results for a newly completed dataset.
The analysis includes the original CUB-200-2011 dataset, CUB 2019 dataset which may have been included in the training data of CLIP, and the CUB 2023 dataset, which consists of images that are not part of CLIP's training data.
Results
dataset | CUB | CUB-2019 | CUB-2023 |
---|---|---|---|
Top-1 Accuracy | 51.86 | 41.51 | 40.89 |
Top-5 Accuracy | 82.86 | 73.74 | 72.43 |
Comparison with Published Findings
In "If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions," six fine-grained image classification datasets, including CUB-200-2011, were evaluated using CLIP to analyze how Vision-Language Models prioritize information. The zero-shot accuracy of CLIP on the CUB dataset was reported as 51.4% (Esfandiarpoor et al., 2024).
In the study "How Well Does CLIP Understand Texture?" the authors performed zero-shot learning on various texture and material classification datasets. The zero-shot classification top-k accuracy on the CUB dataset is presented in the table within the paper. Notably, the top-1 accuracy using default class names (common names) was 51.8%. A significant finding of this paper is the substantial performance drop observed when common names were replaced with scientific names, genus, family, and order names (Wu and Maji., 2022).
In the paper "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts," the authors reported a zero-shot transfer accuracy of 54.70% on the CUB dataset using the ViT-B/16 backbone model (Maniparambil et al., 2023). Using the model ViT-B/16 in my experiments also resulted in a 55% accuracy.
Reference
Esfandiarpoor, Reza, Cristina Menghini, and Stephen H. Bach. “If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions.” arXiv, March 25, 2024. https://doi.org/10.48550/arXiv.2403.16442.
Maniparambil, Mayug, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E. O’Connor. “Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts,” 262–71, 2023. https://openaccess.thecvf.com/content/ICCV2023W/MMFM/html/Maniparambil_Enhancing_CLIP_with_GPT-4_Harnessing_Visual_Descriptions_as_Prompts_ICCVW_2023_paper.html.
Wu, Chenyun, and Subhransu Maji. “How Well Does CLIP Understand Texture?” arXiv, November 4, 2022. http://arxiv.org/abs/2203.11449.