Skip to content

Problem Statement

Previously, we made two great dataset CUB-2019 and CUB-2023. However, the zero-shot accuracies using CLIP on these datasets did not match the performance of the original CUB. One possible reason for this is that CLIP may have overfitted on the original CUB dataset. As a result, the zero-shot performance, or "Validation Accuracy", on our CUB-2019 and CUB-2023 is lower compared to the "Training Accuracy". To validate this thoughts, I decided to examine the CLIP's training set in more detail.

Approach

The LAION-400M Dataset

The CLIP's training dataset has not been publicly released, but it is known to be derived from publicly available sources across the Internet. Open-CLIP, an open source implementation of CLIP, is trained on datasets like LAION-400MLAION-2B, both of which originate from Common Crawl, the largest open web crawl data source. I selected the model trained on LAION-400M, as LAION-400M has similar size to the CLIP's training set, the WebImageText, and it shows similar performance to CLIP on on the three CUB datasets:

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy56.3545.0644.08
Top-5 Accuracy86.3176.8275.19
Open-CLIP's performance on three datasets
datasetCUBCUB-2019CUB-2023
Top-1 Accuracy51.8641.5140.89
Top-5 Accuracy82.8673.7472.43
CLIP's performance on three datasets

Finding the Subsets

The dataset consists 413 million text-image pairs, so tracing 12000 images feels like searching needles in the Ocean. To speedup the process, I chose to filter with specific keywords. Luckily the metadata of the dataset provides sample id, url, and text descriptions for each image. This allows me to query images by captions or domains.

The inaturalist Subset

The CUB-2019 and CUB-2023 datasets are derived from inaturalist.org. However, a query on "www.inaturalist.org" returns only 540 results, most of which are non-bird data (insects, fish, plants, etc). Therefore, CUB-2019 and CUB-2023 are clearly not part of the CLIP's training set.

The CUB classes subset

The CUB dataset does not provide image urls, and popular image source finding tools - tinEye search, and google reverse image search does not give satisfiable results in find the image sources. Therefore, I changed my strategies to create subsets of all possible images corresponding to the 200 bird species listed in the CUB dataset.

I then made 200 queries for images with any of the 200 classes in CUB dataset. All the matching results are stored in an Excel file, with one sheet for each species.

The images, although with bird species in the paring text, have non-bird objects(like, phone case, tote bags, hoodie, or postcards). I performed a preliminary filtering to remove these images.

Image Hash

Perceptual hashing is a robust algorithm that detects minor modifications in images, such as cropping, slight changes in brightness and contrast. It is very effective for identifying duplicate or near-duplicate images.

Hashing images from the CUB dataset is straightforward since all the images are available. I simply compute the hash values for each image and save the information in an Excel file. This file also contains 200 sheets, each corresponding to one of the 200 species, making it easier to compare images of the same species.

The LAION-400M official website provides the instructions to access the images using img2dataset. So I could download the images in batch and then do the perceptual hash. The hash values are saved back to the spreadsheet.

Image Matching

Both the spreadsheets of CUB and LAION-400M are structured as follows:

File name/ File ID | File Path | Hash Value

I could simply measure any two images' similarity by computing their hamming distance (the difference between two hash values).

Since each species have its own sheet, I can compare hash values only for images within the same class. This approach avoids the issue of matching images from different species.

Thresholding and results

Setting the threshold is challenging. With a threshold of 10, we identified 166 pairs; with a threshold of 12, we found 541 pairs. There are 1,234 image pairs with a Hamming distance of less than 15. Even with low distances, some images appear relatively different or have significantly different content.

In the following examples, the columns shows index-difference, CUB image, LAION-400M image.

Concerns

The Text Captions

Some are concerned that images in CUB dataset might also exist in LAION-400M but with different captions, such as "photo of a yellow bird". Querying only with species names in my initial step might miss a significant amount of images.

To address this, I reviewed the CUB-200 paper. The articles mentions using species names as query terms on Flickr and employing Amazon Mechanical Turk for image filtering and annotating. The process did not involves expert bird identification. Therefore, I believe using the same species names for querying is reasonable.

Unavailable URLs

While using img2dataset to download relevant images from LAION-400M, some downloads failed because the images are saved in URL format and some URLs are no longer available. This indicates that there are images in the training set that are no longer accessible for tracing.

Sample report by image2dataset program, showing fail-to-download image count and error log.

The Near-Duplicates

While examining the results, I have observed images that look very similar but not identical, for example, photos of the same species taken from the same angle.

Although they are not identical, their similarity raises a question: Could such photos be indicative of overtraining by CLIP? The answer is not yet known, but I plan to explore this in my future research.

Since we've been able to reproduce the results of "Do LLMs Know About Hallucination?" we have begun to ask about whether or not the authors' technique can be applied to more realistic content. The datasets in this paper are composed almost entirely of questions with answers that are single-sentence of single-word factoids. In reality, however, LLMs can be quite wordy. The question is now, can we find the needles (confabulations) in this haystack (paragraph)?

To test this, I took inspiration from their lists of the top 10 tokens associated with the directions of correctness and hallucination (Table 1), and I attempted to apply it to our new task. Specifically, I tracked how close the transition vectors were to the word "True" (based on the model's token classification layer) over the course of the generation of the answer.

The dataset used is a subset of 20 SOAP notes from our EHA dataset along with each note's respective AI-generated response. So, our "question" for each datapoint in this set is, "I want you to act as an attending physician and give feedback on the SOAP note below," followed by the note. The "answers" are each note's AI-generated feedback. However, for each note, I've manually changed the feedback in the medication section of each note to say, "I think you prescribe Benadryl." Benadryl has not been originally prescribed in any of the unedited feedback, so that means that in each case, Benadryl is an inappropriate medication to prescribe.

Above is a graph which just shows the "swings" in classification values for "True" of each token once the word "Benadryl" has been generated (in red), and the word right before it (in blue). So, each blue token is a word that is not in any way actually wrong, followed by a red dot representing a medication that is clearly inappropriate for that patient. Each blue-red pair of tokens are from the same response. This visualization makes a compelling case for there being a clear "swing" away from True in the model's embedding space when a wrong word has been generated.

However, when looking at a longer slice of each response, it becomes clear that the perceived downward swing in the first chart is not actually real. Especially looking at the dark blue line representing the average, the expected "drop-off" at the end of the graph does not occur. This means that sheer similarity to "True" is not enough for us to detect a single-word confabulation. However, this is just one possible approach, and going forward I will be investigating more.

2

TL;DR: We tested our previous text-only consistency score on two more datasets: Stanford Dogs and Flower102. We compared it to Xiaotong's vision-text score, which is computed based on generated images.

It worked very well and robustly (hopefully, there are no bugs in my code)!

Work in Progress: Visualization

Recall that our text-only score intuitively evaluates the degree to which the CLIP model understands a concept and how well it separates the target class from neighboring classes.

I want to create some visualizations to see what the classes with the lowest text consistency scores can tell us:

In the following figures, I plot the images of the class with the lowest text consistency score and the class they are most likely to be predicted as, in a left vs. right manner. Additionally, I indicate their accuracy and text consistency score above each pair.

Ideally, we want to determine why a class is misclassified: is it because the CLIP model doesn't understand the concept of the class, or because it can't distinguish the class from its neighboring classes? Or maybe both?

Example of Lack of Knowledge

Let's start with a patch of plot of the Flower102 dataset (The complete high-resolution plot is available at: https://github.com/ywugwu/CSCI-6212/blob/main/_posts/imgs2/Flower102_lowest_type_1_plus_2_plus_3_text_consistency_score.png?raw=true):

We plot the images of the class that has the lowest text consistency score vs the images that're mostly predicted by CLIP:

The two flowers do not look similar, but CLIP most likely predicts the left red flowers as the right purple flowers. Intuitively, CLIP understands the right purple flower quite well, as it has a high text consistency score and accuracy. In addition, the red flower doesn't look similar to the purple flower. Seemingly, the left red flower is misclassified because CLIP can't understand what it is.

Example of Appearance Confusion (Need More Scores to Support)

Next, let's look at an example from the Stanford Dogs dataset: these two dogs are, in my opinion, misclassified because they are so similar in appearance. We can also see their text consistency score, and accuracy are similar.

Our consistency score is composed of summing different text consistency scores: at least we have a semantic understanding score and a class confusion score. I think I should also indicate the separate text consistency scores, as they could explain a lot.

The first one.

But wait! There's more! We may be able to learn something about the way LLMs work based on 2nd place to last place.

I'll back up and say a little bit more about what exactly I mean when I ask, "Which sentence receives the most attention?" In our paper we resented at ICHI we remarked on how in some cases it seemed like GPT was "paying lip service" to certain aspects of our prompt, and not actually changing its answers to match concerns we would raise about cost, time, etc.

We'd insert into our prompt something to the effect of, "These test can be very expensive, so do not recommend ones that are not relevant to the patient's acute condition." Then GPT would respond with the same test recommendations it did for our "base" prompt, but it would throw in a sentence about being careful to not recommend too many tests. This would suggest that for in the sentence with the actual recommendations, it was paying a lot of attention to the list of symptoms, and for the sentence after it (where it would warn against over-recommending tests), it would be paying a lot of attention to the warning in our prompt.

This raises the question of whether or not we can actually observe such a pattern at the sentence level. What would an attention map, like the one below, look like grouped into sentences?

In order to be able to investigate this question, we unfortunately will have to look at a model that isn't actually in the GPT-4 family of models, since access to those models' attention values are not available. Additionally, I wanted to build a proof-of-concept that I could run my own computer, so I picked the 7 billion parameter version of LLaMA 2 available on HuggingFace.

An attention matrix for the (defined below) input-output pair's processing through LLaMA 2. Here, the x and y axes are the tokens of the input-output pair, going from the first token at what is the (0, 0) point on most other graphs, to the last token. Here, the color scale is 0-1. This will change on other figures.

Above we can see the attention matrix for the following input and output to LLaMA 2:
Input: Tell me a fun fact about the Holy Roman Empire. Then, tell me a fun fact about nuclear fusion. Then, tell me a fun fact about the TV show Friends. Please tell me one fact for each.
Output: Fun fact about the Holy Roman Empire: The Holy Roman Empire was the first multinational state in European history. It was a political entity that existed from 962 to 1806 and included territories in modern-day Germany, Austria, Switzerland, Belgium, the Netherlands, and parts of France. Fun fact about nuclear fusion: Nuclear fusion is the process by which atomic nuclei combine to form a heavier nucleus, releasing energy in the process. It is the same process that powers the sun and other stars. Scientists have been working on developing a practical fusion reactor for decades, but it remains an elusive goal. Fun fact about the TV show Friends: The iconic Central Perk coffee shop from the show was actually filmed in a real coffee shop in Los Angeles called "The Coffee Bean & Tea Leaf." The show's creators chose this location because they wanted the coffee shop to have a cozy, inviting atmosphere that would feel like a second home to the characters.

This weird prompt was chosen as a "toy" example so that we would have a response conducive to seeing a shifting focus, from fact #1 to fact #3. However, at least here at the token level, the main thing we see instead is that a lot of attention is being paid to the first token in the prompt: the start token, "<s>." In fact, this level of attention paid to "<s>" is so great that, when we combine the tokens into sentences and look at how the attention is distributed over these sentences, that is the primary feature we see.

A sentence-level attention matrix. Here, the color scale is 0-0.09, a significant decrease from the previous figure.

Interestingly, there is a spike in attention on the sentence, "Fun fact ... history." However, this is the first sentence of the response, and even though special tokens are not shown on the axes of this graph, it also starts with the special token "<s>." Since that really skews the levels of attention, let's look at these sentences without the special tokens.

A "cleaner" version of the sentence-level attention matrix. Here, the color scale is 0-0.025. So, any changes represented here are very small.

A greater level of attention is still being given to the first sentence of the output, eve when not counting the special tokens. Outside of this observation, it's hard to tell to what degree the model is placing "extra" attention on what humans would perceive to be the "most relevant" sentence of the prompt. I had hoped for much more dramatic results that might suggest that the model pays a whole bunch of attention to the most relevant sentence and almost to none to the others. Perhaps a larger experiment would show a slightly statistically significant difference that would be more definitive. But I don't think that I'm going to go for it, mainly because I anticipate that, if the experiment did show the LLMs do pay slightly more attention to more relevant sentences, I'd be left with a lingering, "So what?"

2

Abstract

In this blog post, we explore the use of text embeddings to simulate image embeddings and investigate the effectiveness of combining multiple consistency scores to enhance CLIP's zero-shot accuracy.

We find that:

  1. Text embeddings can be used to approximate image embeddings.
  2. Combining various consistency scores by summing them results in a stronger consistency score that correlates well with CLIP's zero-shot accuracy.

Introduction

Xiaotong has developed two consistency scores that strongly correlate with CLIP's zero-shot accuracy. This study aims to determine if these scores can be alternated by replacing image embeddings with descriptive text embeddings and by combining multiple consistency scores.

Here, I_i represents the image embedding of image i, and T_i represents the text embedding of text i. We hypothesize that an image seen by CLIP can be described equivalently by text.

For example, the figure can be depicted as: "a small bird perched on a branch. The bird has a yellowish breast and throat, with a mix of gray and olive-green plumage on its back and wings..."

Intuitively, descriptive text embedding can replace the image embedding in Xiaotong's equation. The new text-only consistency score may still be strongly correlated to CLIP's zero-shot accuracy.

Methodology

We propose two types of text consistency scores corresponding to Xiaotong's equations, in which the descriptive texts are generated from chatGPT:

bird_appearances = {
    "Black footed Albatross": "large seabird, North Pacific Ocean, wingspan 6.5 to 7.5 feet, dark plumage, black or dark brown, pale or white areas around beak, pale or white areas under eyes, black beak, black feet",
    "Laysan Albatross": "large seabird, wingspan up to 7 feet, white head and body, dark upper wings and back, dark eye patch, pink beak, pink feet",
    "Sooty Albatross": "large seabird, sooty brown to black plumage, long slender wings, white crescent above and behind the eye, dark bill, dark feet",
...}

Then, we formulate type1 & type2 consistency scores as below:

We use the type1 text consistency score to correspond to Xiaotong's first equation and the type2 text consistency score to correspond to Xiaotong's second equation.

Results

We compare Spearman's correlation coefficients of the text-only consistency scores with CLIP's zero-shot accuracy:

The text-only consistency scores show significant correlations but are lower than Xiaotong's image-text scores. This may be due to using class-wise descriptions instead of more fine-grained image-wise descriptions (Using class-wise descriptions is obviously suboptimal since images' equivalent text descriptions are naturally different even if they belong to the same species).

Future Work

We plan to improve type1 and type2 scores by using ChatGPT to generate 20,000 image-wise descriptions for the CUB dataset, which we expect will boost the Spearman correlation scores.

Additional Findings

From our previous week's results (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), we found that CLIP's zero-shot accuracy correlates with how well it separates embeddings of different classes

This week, we take one more step by saying that CLIP's zero-shot accuracy correlates to how well it separates embeddings of different classes and how well it compacts embeddings of the same classes.

More formally, a type3 text consistency score is:

Results

While the type3 text consistency score is similar to the type2 score (0.51 vs 0.50), their combination yields a much stronger consistency score of 0.59 (+0.08). What's next? Maybe a well-designed text-only method that can approach vision-language methods.

6

Problem Statement

Over-represented datasets in model training data can lead to biases, causing models to perform well on these specific datasets but poorly on others. For instance, the CUB datasets might be over-represented in CLIP's training data.

My last week's post shared the results of CLIP zero-shot classification for CUB, CUB-2019, and CUB-2023: https://blogs.gwu.edu/pless/2024/06/10/comparative-study-of-clips-zero-shot-accuracy-on-cub-dataset/

This week, I want to find out the factors that make our accuracy 10% lower than CUB-200-2011. Is this discrepancy due to CUB being over-represented in CLIP's training data, or are there issues with our datasets?

Experiments

Exp1: Confusion Matrices

To evaluate CLIP's performance on CUB, CUB-2019, and CUB-2023, I plotted the following matrices to show the accuracy in predicting the correct classes. In the matrices, top5 predictions are included.

  • All three matrices show a clear diagonal line, indicating that CLIP correctly predicts the majority of the images in top5 results.
  • Misclassifications (the non-zero values scattered off the diagonals) are dispersed throughout the matrices, but clusters in certain areas. This suggests that CLIP tends to confuse certain classes, especially those that are visually similar such as different types of sparrows and warblers. The clusters are common across all three matrices.
  • The color scale on the right indicates the frequency of predictions. Brighter colors represent higher frequencies. The brightness of the three matrices are in the order of CUB > CUB-2019 > CUB-2023. To me, CUB and CUB-2019 exhibit similar brightness in this plot despite their numerical results.

Exp2: Minor Adjustments

My initial guess was that the original CUB, containing of 5994 training images and 5794 testing images, is twice as large as our CUB-2019 or CUB-2023 dataset. Our datasets have only as many images as the testing set. To eliminate the effect of dataset sizes, I only use testing set of CUB. It turns out the overall perfomance remains the same.

I came across the templates for zero-shot classification on CLIP's official git repository. And found the prompt for BirdSnap is 'a photo of a {class_name}, a type of bird.' Thinking it would be helpful for bird specific datasets, I adopted this prompt and got 1% improvements for each dataset.

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy53.1842.2941.58
Top-5 Accuracy84.0474.6173.84

Exp3: show stats and examples of images

To further investigate the results shown by the confusion matrices, I decided to examine the images themselves. I produced two set of images:

  1. Bar charts displaying difference in accuracies for classes between two datasets;
    1. CUB vs CUB 2019;
    2. CUB vs CUB 2023;
    3. CUB 2019 vs CUB 2023;
  2. Large composite images containing all images in per class for each dataset;

There are 8 classes in the original CUB datasets that are not presented in our CUB-2019 and CUB-2023 datasets due to insufficient images. Therefore, we exclude the first 8 bars from each chart.

(The chars are extremely long and gets blurry when I upload, I decided to share the images through google drive.)

I have not did a comprehensive analysis for all charts, but simply looking at the corresponding composite images in top categories with a lower score in CUB-2019 or CUB-2023 provide some insights.

Indigo Bunting in CUB-200-2011
Indigo Bunting in CUB-2019
Indigo Bunting in CUB-2023

Let's look at another example:

Purple Finch in CUB-200-2011
Purple Finch in CUB-2019
Purple Finch in CUB-2023

Both Purple Finch and Indigo Bunting are bird species with very distinct appearances between male and female birds. In the original CUB dataset, the majority of the images are of male birds, which have more prominent features that make them easier to classify. However, our dataset includes more images of female and immature birds, resulting in nearly a 50% reduction in accuracy for this class compared to the original CUB.
Aside from the female and immature birds, our dataset also includes photos taken from different angles, dark environment, and distracting backgrounds, all of which increase the difficulty of classification.

Exp3: Change class name (inspired by Abby)

Although CLIP's offcially posted prompts do not include hyphens in bird names, I personally think class names like "Black-footed Albatross" are more common in the real-word and would therefore yield better results than "Black footed Albatross". So I conducted experiments by adding the hyphens back to the class names.

text inputCUB w/ hyphenCUB w/o hyphen
Top153.4953.19
Top584.5484.04

However, adding the hyphen only results in a slight improvement overall. A closer look at the per class results reveals that some classes did see significant improvements. However, not all classes with a similar format like "Black-footed Albatross" improves as expected; some even experienced a decrease in accuracy. My current understanding of CLIP does not explain this result. If anyone has insights into this, please feel free to share in the comments.

Conclusion

At the end of the day, I cannot make an arbitrary conclusion like "CUB is over-represented in CLIP dataset." But, I am convinced that our dataset has its strength: we have more female birds, immature birds, and photos in more angles. However, I am confident that our dataset has its strengths: we feature more female birds, immature birds, and photos captured from various angles. Moving forward, I am planning to conduct a comparative study between CUB-2019 and CUB-2023, and analyzing the images in a different, more quantitative manner.

2

Continuing from the work of last week (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), we want to figure out CLIP's zero-shot accuracy by only looking into its text space.

First, let's briefly define the text prompts used in our study:

  • Plain Text: "a photo of a {class_name)}"
  • Descriptive Text: "a photo of a {class_name} with {description}"

Ideally, the plain text and descriptive text together form a local space (manifold) corresponding to their class. And in that space, the plain text should serve as the "center."

Then, we want to know whether the manifolds of different classes vary due to CLIP's differing understanding of the concepts? Our first visulization likely indicates that they do:

The plot shows the visualization of CLIP's text embeddings on different classes:

  • Red 'x's are k samples with the lowest accuracy (bottom-k)
  • Green 'o's are k samples with the highest accuracy (top-k)
  • Black '+'s present the plain text (the center)

All green points look compact, while some red points , especially the left-top ones are visually "skewed".

So the problem now becomes how to quantify the "manifold structure" and connect it to zero-shot accuracy.

The most straightforward method is to use variance to depict compactness. Unfortunately, we can hardly tell any correlation between the variance and accuracy.

By increasing the scope of the previous visualization, we see the reason:

When considering the entire text space, the compactness or skewness of individual classes becomes negligible. Instead, the primary concern shifts to the intersection or confusion between different classes. This issue was highlighted in last week's work (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), where we observed a significant correlation between the distance of a class to its nearest neighboring class and its zero-shot accuracy.

So, what if we continue by using Xiaotong's idea:

Consistency Score = ("name 1" - "name 2") dot ("description 1" - "description 2")

Here, name 1 is the class we focus, and name 2 is the nearest class to class 1. Description 1 and Description 2 are both detailed descriptions of the classes generated by chatGPT.

We get some positive correlation:

  • Pearson correlation: 0.30
  • Spearman correlation: 0.28

Conclusion

We have investigated the properties of CLIP's text embedding space that affect its zero-shot capacity. Our findings are as follows:

  1. The internal structure ("self-structure") of individual classes has minimal impact.
  2. The relationship between the nearest classes is crucial, aligning with principles from classical machine learning.

The idea ("name 1" - "name 2") dot ("description 1" - "description 2") is promising and I want to optimize some details to make the correlation more significant

Appendix: What I Tried but Failed

I'm interested in finding the relationship between self-structure and zero-shot capacity so I tried a lot experiments.

The methods for computing "self-structure" failed including but not limited to:

  • ("a {bird_name} at daytime" - "a {bird_name} at nighttime") dot ("a bird at daytime" - "a bird at nighttime")
  • max_{i!=j} (corpus[i][bird_name] - corpus[j][bird_name]) dot max_{i!=j} (corpus[i][bird] - corpus[j][bird])

where the corpus ≈ [ f"a photo of a {bird_name} flying over the ocean",

        f"a photo of a {bird_name} perched on a tree branch",

        f"a photo of a colorful {bird_name} in a rainforest",

        f"a photo of a majestic {bird_name} soaring high in the sky",

        f"a photo of a flock of {bird_name}s migrating at sunset",

        f"a photo of a {bird_name} hovering near a flower",

        f"a photo of a {bird_name} waddling on the ice",

        f"a photo of an {bird_name} peeking out from a tree hole",

        f"a photo of a {bird_name} standing in a shallow pond",

        f"a photo of a {bird_name} tapping on a tree trunk",

        f"a photo of a group of {bird_name}s by a lake",

        f"a photo of a {bird_name} feeding its chicks in the nest",

        f"a photo of a {bird_name} fishing in a river",

        f"a photo of a {bird_name} diving into water",

        f"a photo of a {bird_name} with vibrant feathers preening itself",

        f"a photo of a {bird_name} singing at dusk"]

2

Me presenting at ICHI All4Health workshop. Picture credit: Robert Pless

Earlier this week, I attended the ALL4Health workshop at ICHI, in which I presented some early results from our ongoing pilot study with EHA Clinics. Here, I want to share some thoughts I had while listening to the day's talks.

The day reaffirmed for me how impactful the integration of LLMs in LMICs can be. An obvious potential critique of the practice is, if people are really interested in making quality care more accessible, why are they not funding efforts to train people who live there to be doctors, as opposed to implementing some weird, convoluted AI system? There are several reasons that the workshop made apparent to me. Primarily, there are already many active healthcare institutions in these regions, as the existence of EHA Clinics and other organizations like it (many of which I learned about at the workshop) proves; upending these institutions and replacing them with something completely dependent on outside support and intervention is not ideal from a developmental standpoint. Additionally, an approach purely focused on skills development may ultimately undermine its own goals, since a skilled physician may end up leaving their home country for an opportunity to work a wealthier one. In perhaps one of the most striking remarks of the day, one of the speakers referenced a quip that an Ethiopian colleague of theirs had made: "There may just be more Ethiopian opthamologists in Chicago than in Ethiopia."

It should also be noted that this movement towards LLM integration in clinical systems is happening in rich countries as well. Keynote speaker Mark Drezde spoke very openly about how, just a few years ago (the days of GPT-2), he firmly believed that people would not be able to build a medical chatbot to include in a real clinical setting. He pointed to how an ability to answer medical licensing questions accurately does not make one a good doctor, as well as the facts that clinical EHR data are often ambiguous and that the truly hard cases for doctors are those that don't match textbook presentation. However, he humorously touted that he was dead wrong. LLMs are currently far more capable with medical data than he thought at the time that they could be. Today, major tech and medical companies are partnering to realize in-clinic chatbot assistants. Now, the questions to be addressed are not if they can be useful, but how exactly they should be used and how exactly their performance should be evaluated. Dr. Drezde personally advocated for LLMs being used to identify and recommend which specialist(s) a patient should see in their particular case, due to the incredible breadth of knowledge that LLMs have. This knowledge is useful not just for rich patients, and given how inexpensive a given query of, for example, GPT-4 is, it can and should be used in as many possible safe contexts as is feasible.

Like the paper I presented, the primary concern of all of the work I got to see focused on the question of how to make LLM integration safe and feasible. In most cases, the work was concentrated on a single case study---a specific clinic in a defined region of sub-saharan Africa, sometimes focused on a particular specialty, such as maternal care. A key objective for many of the studies was to identify and describe integration strategies that the human staff at the clinic liked and trusted. In line with this goal, many of the speakers presented rankings of models-prompt pairs by both algorithmic means (such as BLEU) and human feedback surveys. What I found to be perhaps one of the most interesting takeaways from the workshop is that (according to my personal notes), out of the four talks that presented a ranking of models and prompts based on scores from both algorithmic evaluations and human feedback, each of them reported a different "best" model for the algorithmic ranking and the human feedback. The models that performed best on benchmark evaluations did not perform best in the feedback surveys. Although this is a small sample, this suggests that there still does not exist an algorithmic metric that adequately approximates how likely a human is to have a positive perception of the language model's output. At the same time, it also is unclear what exactly makes a "good" response. Is it politeness? Is it how clearly organized and laid out the information presented is? These questions are particularly relevant to our ongoing project with eHealth Africa, as we continue to try to refine our prompt and understand what GPT does well and does poorly.

Based on a conversation I was able to have with some of the organizers and attendees, there was a strong belief among those at the workshop that there would be another iteration of it next year, and that it would be bigger, as LLMs continue to be more widely adopted throughout the healthcare industry. Based on what I saw at this year's event, I would say that there are plenty of interesting questions related to this subject that are unanswered, so another workshop next year (and for several more years after that) would certainly be valuable.

The primary goal of this study is to establish baseline CLIP zero-shot results for a newly completed dataset.

The analysis includes the original CUB-200-2011 dataset, CUB 2019 dataset which may have been included in the training data of CLIP, and the CUB 2023 dataset, which consists of images that are not part of CLIP's training data.

Results

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy51.8641.5140.89
Top-5 Accuracy82.8673.7472.43

Comparison with Published Findings

In "If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions," six fine-grained image classification datasets, including CUB-200-2011, were evaluated using CLIP to analyze how Vision-Language Models prioritize information. The zero-shot accuracy of CLIP on the CUB dataset was reported as 51.4% (Esfandiarpoor et al., 2024).

In the study "How Well Does CLIP Understand Texture?" the authors performed zero-shot learning on various texture and material classification datasets. The zero-shot classification top-k accuracy on the CUB dataset is presented in the table within the paper. Notably, the top-1 accuracy using default class names (common names) was 51.8%. A significant finding of this paper is the substantial performance drop observed when common names were replaced with scientific names, genus, family, and order names (Wu and Maji., 2022).

In the paper "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts," the authors reported a zero-shot transfer accuracy of 54.70% on the CUB dataset using the ViT-B/16 backbone model (Maniparambil et al., 2023). Using the model ViT-B/16 in my experiments also resulted in a 55% accuracy.

Reference

Esfandiarpoor, Reza, Cristina Menghini, and Stephen H. Bach. “If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions.” arXiv, March 25, 2024. https://doi.org/10.48550/arXiv.2403.16442.

Maniparambil, Mayug, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E. O’Connor. “Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts,” 262–71, 2023. https://openaccess.thecvf.com/content/ICCV2023W/MMFM/html/Maniparambil_Enhancing_CLIP_with_GPT-4_Harnessing_Visual_Descriptions_as_Prompts_ICCVW_2023_paper.html.

Wu, Chenyun, and Subhransu Maji. “How Well Does CLIP Understand Texture?” arXiv, November 4, 2022. http://arxiv.org/abs/2203.11449.

7

TL;DR: We want to predict CLIP's zero-shot ability by only seeing its text embedding space. We made two hypotheses:

  1. CLIP’s zero-shot ability is related to its understanding of ornithological domain knowledge, such that the text embedding of a simple prompt (e.g., "a photo of a Heermann Gull") aligns closely with a detailed descriptive prompt of the same bird. (This hypothesis was not supported by our findings)
  2. CLIP’s zero-shot ability is related to how well it separates one class's text embedding from the nearest text embedding of a different class. (This hypothesis showed moderate support)

Hypothesis 1:

Motivation

How would a bird expert tell the difference between a California gull and a Heermann's Gull?

A California Gull has a yellow bill with a black ring and red spot, gray back and wings with white underparts, and yellow legs, whereas a Heermann's Gull has a bright red bill with a black tip, dark gray body, and black legs.

Experts utilize domain knowledge/unique appearance characteristics to classify species.

Thus, we hypothesize that, if the multimodal training of CLIP makes CLIP understand the same domain knowledge of experts, the text embedding of "a photo of a Heermann Gull" (let's denote it asplain_prompt(Heermann Gull)) shall be close (and vice versa) to the text embedding of "a photo of a bird with Gray body and wings, white head during breeding season plumage, Bright red with black tip bill, Black legs, Medium size. Note that it has a Bright red bill with a black tip, gray body, wings, and white head during the breeding season." (let's denote it as descriptive_prompt(Heermann Gull)).

For example, the cosine similarity between the two prompts of the Chuck-will's-widow is 0.44 (lowest value across the CUB dataset), and the zero-shot accuracy on this species is precisely 0.

Then, we can formulate our hypothesis as follows

(T_* denotes the text embedding of *):

We tested our hypothesis in the CUB dataset.

Qualitative and Quantitative Results

The cosine similarity between "a photo of Yellow breasted Chat" and "a photo of a bird with Olive green back, bright yellow breast plumage" is 0.82, which is the highest value across the whole CUB dataset. However, the zero-shot accuracy on this species is 10% (average accuracy is 51%)

We got the Pearson correlation coefficient and the Spearman correlation coefficient between accuracy and the text embedding similarity as follows:

  • Pearson correlation coefficient = -0.14, p-value: 0.05
  • Spearman correlation coefficient = -0.14 p-value: 0.05

The coefficients suggest a very weak linear correlation.

We also make a line plot of accuracy vs. text embedding similarity, which shows no meaningful trends (maybe we can say the zero-shot accuracy tends to zero if the text embedding similarity score is lower than 0.50):

Thus, we conclude that the hypothesis is not supported.

I think there are possibly two reasons:

  • The lack of correlation might be due to the nature of CLIP's training data, where captions are often not descriptive
  • CLIP does not utilize domain knowledge in the same way humans do

Hypothesis 2

Motivation

We examine the species with nearly zero CLIP accuracy:

The left ones are input images, and the right ones are the images of the most predicted species for the input.

We can see that they are close in appearance. Therefore, we wonder if their text embeddings are close as well.

More formally, we want to examine the cosine similarity between one species' text embedding and its nearest text embedding to see if CLIP's inability to distinguish them at the semantic level possibly causes the classification to fail.

Qualitative and Quantitative Results

We also got the Pearson correlation coefficient and the Spearman correlation coefficient:

  • Pearson correlation coefficient = -0.43 p-value: 1.3581472478043406e-10
  • Spearman correlation coefficient = -0.43 p-value: 1.3317673165555703e-10

which suggests a significant but moderate negative correlation.

And a very noisy plot ......

Wait, what if we smooth the line plot by averaging every 20 points to 1 point:

The trend looks clearer, although there is still an "outlier."

In conclusion, I think we can't determine whether CLIP's zero-shot without giving the information/context of other classes. For example, CLIP completely failed to classify a California gull vs. a Heermann's Gull, while it perfectly solves the problem of, e.g., a banana vs. a Heermann's Gull.

Next step, I want to investigate:

  1. Are there some special local geometry properties that are related to the zero-shot ability?
  2. When and why does CLIP's zero-shot prediction fail? Is it because the image encoder misses the detailed high-resolution features, or is it because the text encoder fails to encode the most "unique" semantic information. Or maybe it is just because we need a larger model to align the image embedding space and the text embedding space.