Problem Statement

Previously, we made two great dataset CUB-2019 and CUB-2023. However, the zero-shot accuracies using CLIP on these datasets did not match the performance of the original CUB. One possible reason for this is that CLIP may have overfitted on the original CUB dataset. As a result, the zero-shot performance, or "Validation Accuracy", on our CUB-2019 and CUB-2023 is lower compared to the "Training Accuracy". To validate this thoughts, I decided to examine the CLIP's training set in more detail.

Approach

The LAION-400M Dataset

The CLIP's training dataset has not been publicly released, but it is known to be derived from publicly available sources across the Internet. Open-CLIP, an open source implementation of CLIP, is trained on datasets like LAION-400M, LAION-2B, both of which originate from Common Crawl, the largest open web crawl data source. I selected the model trained on LAION-400M, as LAION-400M has similar size to the CLIP's training set, the WebImageText, and it shows similar performance to CLIP on on the three CUB datasets:

dataset	CUB	CUB-2019	CUB-2023
Top-1 Accuracy	56.35	45.06	44.08
Top-5 Accuracy	86.31	76.82	75.19

Open-CLIP's performance on three datasets

dataset	CUB	CUB-2019	CUB-2023
Top-1 Accuracy	51.86	41.51	40.89
Top-5 Accuracy	82.86	73.74	72.43

CLIP's performance on three datasets

Finding the Subsets

The dataset consists 413 million text-image pairs, so tracing 12000 images feels like searching needles in the Ocean. To speedup the process, I chose to filter with specific keywords. Luckily the metadata of the dataset provides sample id, url, and text descriptions for each image. This allows me to query images by captions or domains.

The inaturalist Subset

The CUB-2019 and CUB-2023 datasets are derived from inaturalist.org. However, a query on "www.inaturalist.org" returns only 540 results, most of which are non-bird data (insects, fish, plants, etc). Therefore, CUB-2019 and CUB-2023 are clearly not part of the CLIP's training set.

The CUB classes subset

The CUB dataset does not provide image urls, and popular image source finding tools - tinEye search, and google reverse image search does not give satisfiable results in find the image sources. Therefore, I changed my strategies to create subsets of all possible images corresponding to the 200 bird species listed in the CUB dataset.

I then made 200 queries for images with any of the 200 classes in CUB dataset. All the matching results are stored in an Excel file, with one sheet for each species.

The images, although with bird species in the paring text, have non-bird objects(like, phone case, tote bags, hoodie, or postcards). I performed a preliminary filtering to remove these images.

Image Hash

Perceptual hashing is a robust algorithm that detects minor modifications in images, such as cropping, slight changes in brightness and contrast. It is very effective for identifying duplicate or near-duplicate images.

Hashing images from the CUB dataset is straightforward since all the images are available. I simply compute the hash values for each image and save the information in an Excel file. This file also contains 200 sheets, each corresponding to one of the 200 species, making it easier to compare images of the same species.

The LAION-400M official website provides the instructions to access the images using img2dataset. So I could download the images in batch and then do the perceptual hash. The hash values are saved back to the spreadsheet.

Image Matching

Both the spreadsheets of CUB and LAION-400M are structured as follows:

File name/ File ID | File Path | Hash Value

I could simply measure any two images' similarity by computing their hamming distance (the difference between two hash values).

Since each species have its own sheet, I can compare hash values only for images within the same class. This approach avoids the issue of matching images from different species.

Thresholding and results

Setting the threshold is challenging. With a threshold of 10, we identified 166 pairs; with a threshold of 12, we found 541 pairs. There are 1,234 image pairs with a Hamming distance of less than 15. Even with low distances, some images appear relatively different or have significantly different content.

In the following examples, the columns shows index-difference, CUB image, LAION-400M image.

Concerns

The Text Captions

Some are concerned that images in CUB dataset might also exist in LAION-400M but with different captions, such as "photo of a yellow bird". Querying only with species names in my initial step might miss a significant amount of images.

To address this, I reviewed the CUB-200 paper. The articles mentions using species names as query terms on Flickr and employing Amazon Mechanical Turk for image filtering and annotating. The process did not involves expert bird identification. Therefore, I believe using the same species names for querying is reasonable.

Unavailable URLs

While using img2dataset to download relevant images from LAION-400M, some downloads failed because the images are saved in URL format and some URLs are no longer available. This indicates that there are images in the training set that are no longer accessible for tracing.

Sample report by image2dataset program, showing fail-to-download image count and error log.

The Near-Duplicates

While examining the results, I have observed images that look very similar but not identical, for example, photos of the same species taken from the same angle.

Although they are not identical, their similarity raises a question: Could such photos be indicative of overtraining by CLIP? The answer is not yet known, but I plan to explore this in my future research.

Locating Images: CUB Dataset vs. CLIP’s Training Data