Skip to content

2

TL;DR: We tested our previous text-only consistency score on two more datasets: Stanford Dogs and Flower102. We compared it to Xiaotong's vision-text score, which is computed based on generated images.

It worked very well and robustly (hopefully, there are no bugs in my code)!

Work in Progress: Visualization

Recall that our text-only score intuitively evaluates the degree to which the CLIP model understands a concept and how well it separates the target class from neighboring classes.

I want to create some visualizations to see what the classes with the lowest text consistency scores can tell us:

In the following figures, I plot the images of the class with the lowest text consistency score and the class they are most likely to be predicted as, in a left vs. right manner. Additionally, I indicate their accuracy and text consistency score above each pair.

Ideally, we want to determine why a class is misclassified: is it because the CLIP model doesn't understand the concept of the class, or because it can't distinguish the class from its neighboring classes? Or maybe both?

Example of Lack of Knowledge

Let's start with a patch of plot of the Flower102 dataset (The complete high-resolution plot is available at: https://github.com/ywugwu/CSCI-6212/blob/main/_posts/imgs2/Flower102_lowest_type_1_plus_2_plus_3_text_consistency_score.png?raw=true):

We plot the images of the class that has the lowest text consistency score vs the images that're mostly predicted by CLIP:

The two flowers do not look similar, but CLIP most likely predicts the left red flowers as the right purple flowers. Intuitively, CLIP understands the right purple flower quite well, as it has a high text consistency score and accuracy. In addition, the red flower doesn't look similar to the purple flower. Seemingly, the left red flower is misclassified because CLIP can't understand what it is.

Example of Appearance Confusion (Need More Scores to Support)

Next, let's look at an example from the Stanford Dogs dataset: these two dogs are, in my opinion, misclassified because they are so similar in appearance. We can also see their text consistency score, and accuracy are similar.

Our consistency score is composed of summing different text consistency scores: at least we have a semantic understanding score and a class confusion score. I think I should also indicate the separate text consistency scores, as they could explain a lot.

2

Abstract

In this blog post, we explore the use of text embeddings to simulate image embeddings and investigate the effectiveness of combining multiple consistency scores to enhance CLIP's zero-shot accuracy.

We find that:

  1. Text embeddings can be used to approximate image embeddings.
  2. Combining various consistency scores by summing them results in a stronger consistency score that correlates well with CLIP's zero-shot accuracy.

Introduction

Xiaotong has developed two consistency scores that strongly correlate with CLIP's zero-shot accuracy. This study aims to determine if these scores can be alternated by replacing image embeddings with descriptive text embeddings and by combining multiple consistency scores.

Here, I_i represents the image embedding of image i, and T_i represents the text embedding of text i. We hypothesize that an image seen by CLIP can be described equivalently by text.

For example, the figure can be depicted as: "a small bird perched on a branch. The bird has a yellowish breast and throat, with a mix of gray and olive-green plumage on its back and wings..."

Intuitively, descriptive text embedding can replace the image embedding in Xiaotong's equation. The new text-only consistency score may still be strongly correlated to CLIP's zero-shot accuracy.

Methodology

We propose two types of text consistency scores corresponding to Xiaotong's equations, in which the descriptive texts are generated from chatGPT:

bird_appearances = {
    "Black footed Albatross": "large seabird, North Pacific Ocean, wingspan 6.5 to 7.5 feet, dark plumage, black or dark brown, pale or white areas around beak, pale or white areas under eyes, black beak, black feet",
    "Laysan Albatross": "large seabird, wingspan up to 7 feet, white head and body, dark upper wings and back, dark eye patch, pink beak, pink feet",
    "Sooty Albatross": "large seabird, sooty brown to black plumage, long slender wings, white crescent above and behind the eye, dark bill, dark feet",
...}

Then, we formulate type1 & type2 consistency scores as below:

We use the type1 text consistency score to correspond to Xiaotong's first equation and the type2 text consistency score to correspond to Xiaotong's second equation.

Results

We compare Spearman's correlation coefficients of the text-only consistency scores with CLIP's zero-shot accuracy:

The text-only consistency scores show significant correlations but are lower than Xiaotong's image-text scores. This may be due to using class-wise descriptions instead of more fine-grained image-wise descriptions (Using class-wise descriptions is obviously suboptimal since images' equivalent text descriptions are naturally different even if they belong to the same species).

Future Work

We plan to improve type1 and type2 scores by using ChatGPT to generate 20,000 image-wise descriptions for the CUB dataset, which we expect will boost the Spearman correlation scores.

Additional Findings

From our previous week's results (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), we found that CLIP's zero-shot accuracy correlates with how well it separates embeddings of different classes

This week, we take one more step by saying that CLIP's zero-shot accuracy correlates to how well it separates embeddings of different classes and how well it compacts embeddings of the same classes.

More formally, a type3 text consistency score is:

Results

While the type3 text consistency score is similar to the type2 score (0.51 vs 0.50), their combination yields a much stronger consistency score of 0.59 (+0.08). What's next? Maybe a well-designed text-only method that can approach vision-language methods.

2

Continuing from the work of last week (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), we want to figure out CLIP's zero-shot accuracy by only looking into its text space.

First, let's briefly define the text prompts used in our study:

  • Plain Text: "a photo of a {class_name)}"
  • Descriptive Text: "a photo of a {class_name} with {description}"

Ideally, the plain text and descriptive text together form a local space (manifold) corresponding to their class. And in that space, the plain text should serve as the "center."

Then, we want to know whether the manifolds of different classes vary due to CLIP's differing understanding of the concepts? Our first visulization likely indicates that they do:

The plot shows the visualization of CLIP's text embeddings on different classes:

  • Red 'x's are k samples with the lowest accuracy (bottom-k)
  • Green 'o's are k samples with the highest accuracy (top-k)
  • Black '+'s present the plain text (the center)

All green points look compact, while some red points , especially the left-top ones are visually "skewed".

So the problem now becomes how to quantify the "manifold structure" and connect it to zero-shot accuracy.

The most straightforward method is to use variance to depict compactness. Unfortunately, we can hardly tell any correlation between the variance and accuracy.

By increasing the scope of the previous visualization, we see the reason:

When considering the entire text space, the compactness or skewness of individual classes becomes negligible. Instead, the primary concern shifts to the intersection or confusion between different classes. This issue was highlighted in last week's work (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), where we observed a significant correlation between the distance of a class to its nearest neighboring class and its zero-shot accuracy.

So, what if we continue by using Xiaotong's idea:

Consistency Score = ("name 1" - "name 2") dot ("description 1" - "description 2")

Here, name 1 is the class we focus, and name 2 is the nearest class to class 1. Description 1 and Description 2 are both detailed descriptions of the classes generated by chatGPT.

We get some positive correlation:

  • Pearson correlation: 0.30
  • Spearman correlation: 0.28

Conclusion

We have investigated the properties of CLIP's text embedding space that affect its zero-shot capacity. Our findings are as follows:

  1. The internal structure ("self-structure") of individual classes has minimal impact.
  2. The relationship between the nearest classes is crucial, aligning with principles from classical machine learning.

The idea ("name 1" - "name 2") dot ("description 1" - "description 2") is promising and I want to optimize some details to make the correlation more significant

Appendix: What I Tried but Failed

I'm interested in finding the relationship between self-structure and zero-shot capacity so I tried a lot experiments.

The methods for computing "self-structure" failed including but not limited to:

  • ("a {bird_name} at daytime" - "a {bird_name} at nighttime") dot ("a bird at daytime" - "a bird at nighttime")
  • max_{i!=j} (corpus[i][bird_name] - corpus[j][bird_name]) dot max_{i!=j} (corpus[i][bird] - corpus[j][bird])

where the corpus ≈ [ f"a photo of a {bird_name} flying over the ocean",

        f"a photo of a {bird_name} perched on a tree branch",

        f"a photo of a colorful {bird_name} in a rainforest",

        f"a photo of a majestic {bird_name} soaring high in the sky",

        f"a photo of a flock of {bird_name}s migrating at sunset",

        f"a photo of a {bird_name} hovering near a flower",

        f"a photo of a {bird_name} waddling on the ice",

        f"a photo of an {bird_name} peeking out from a tree hole",

        f"a photo of a {bird_name} standing in a shallow pond",

        f"a photo of a {bird_name} tapping on a tree trunk",

        f"a photo of a group of {bird_name}s by a lake",

        f"a photo of a {bird_name} feeding its chicks in the nest",

        f"a photo of a {bird_name} fishing in a river",

        f"a photo of a {bird_name} diving into water",

        f"a photo of a {bird_name} with vibrant feathers preening itself",

        f"a photo of a {bird_name} singing at dusk"]

7

TL;DR: We want to predict CLIP's zero-shot ability by only seeing its text embedding space. We made two hypotheses:

  1. CLIP’s zero-shot ability is related to its understanding of ornithological domain knowledge, such that the text embedding of a simple prompt (e.g., "a photo of a Heermann Gull") aligns closely with a detailed descriptive prompt of the same bird. (This hypothesis was not supported by our findings)
  2. CLIP’s zero-shot ability is related to how well it separates one class's text embedding from the nearest text embedding of a different class. (This hypothesis showed moderate support)

Hypothesis 1:

Motivation

How would a bird expert tell the difference between a California gull and a Heermann's Gull?

A California Gull has a yellow bill with a black ring and red spot, gray back and wings with white underparts, and yellow legs, whereas a Heermann's Gull has a bright red bill with a black tip, dark gray body, and black legs.

Experts utilize domain knowledge/unique appearance characteristics to classify species.

Thus, we hypothesize that, if the multimodal training of CLIP makes CLIP understand the same domain knowledge of experts, the text embedding of "a photo of a Heermann Gull" (let's denote it asplain_prompt(Heermann Gull)) shall be close (and vice versa) to the text embedding of "a photo of a bird with Gray body and wings, white head during breeding season plumage, Bright red with black tip bill, Black legs, Medium size. Note that it has a Bright red bill with a black tip, gray body, wings, and white head during the breeding season." (let's denote it as descriptive_prompt(Heermann Gull)).

For example, the cosine similarity between the two prompts of the Chuck-will's-widow is 0.44 (lowest value across the CUB dataset), and the zero-shot accuracy on this species is precisely 0.

Then, we can formulate our hypothesis as follows

(T_* denotes the text embedding of *):

We tested our hypothesis in the CUB dataset.

Qualitative and Quantitative Results

The cosine similarity between "a photo of Yellow breasted Chat" and "a photo of a bird with Olive green back, bright yellow breast plumage" is 0.82, which is the highest value across the whole CUB dataset. However, the zero-shot accuracy on this species is 10% (average accuracy is 51%)

We got the Pearson correlation coefficient and the Spearman correlation coefficient between accuracy and the text embedding similarity as follows:

  • Pearson correlation coefficient = -0.14, p-value: 0.05
  • Spearman correlation coefficient = -0.14 p-value: 0.05

The coefficients suggest a very weak linear correlation.

We also make a line plot of accuracy vs. text embedding similarity, which shows no meaningful trends (maybe we can say the zero-shot accuracy tends to zero if the text embedding similarity score is lower than 0.50):

Thus, we conclude that the hypothesis is not supported.

I think there are possibly two reasons:

  • The lack of correlation might be due to the nature of CLIP's training data, where captions are often not descriptive
  • CLIP does not utilize domain knowledge in the same way humans do

Hypothesis 2

Motivation

We examine the species with nearly zero CLIP accuracy:

The left ones are input images, and the right ones are the images of the most predicted species for the input.

We can see that they are close in appearance. Therefore, we wonder if their text embeddings are close as well.

More formally, we want to examine the cosine similarity between one species' text embedding and its nearest text embedding to see if CLIP's inability to distinguish them at the semantic level possibly causes the classification to fail.

Qualitative and Quantitative Results

We also got the Pearson correlation coefficient and the Spearman correlation coefficient:

  • Pearson correlation coefficient = -0.43 p-value: 1.3581472478043406e-10
  • Spearman correlation coefficient = -0.43 p-value: 1.3317673165555703e-10

which suggests a significant but moderate negative correlation.

And a very noisy plot ......

Wait, what if we smooth the line plot by averaging every 20 points to 1 point:

The trend looks clearer, although there is still an "outlier."

In conclusion, I think we can't determine whether CLIP's zero-shot without giving the information/context of other classes. For example, CLIP completely failed to classify a California gull vs. a Heermann's Gull, while it perfectly solves the problem of, e.g., a banana vs. a Heermann's Gull.

Next step, I want to investigate:

  1. Are there some special local geometry properties that are related to the zero-shot ability?
  2. When and why does CLIP's zero-shot prediction fail? Is it because the image encoder misses the detailed high-resolution features, or is it because the text encoder fails to encode the most "unique" semantic information. Or maybe it is just because we need a larger model to align the image embedding space and the text embedding space.

TL;DR: We can prompt chatGPT to generate an "attention map" for themselves (demo available at https://ywugwu.streamlit.app/).

Currently, we're working on getting better prompts via open-source LLMs like Llama3.

Introduction

We're interested in letting LLMs introspect. That is, let an LLM answer which part of the input contributes the most to the word we're interested in the output text (like an NLP-version Grad-Cam but by Prompting:

We want an NLP-version Grad-Cam (https://arxiv.org/pdf/1610.02391) but by Prompting

We have a demo at https://ywugwu.streamlit.app/ that can do this:

VIz

Method

An overview of our prompt: we merge the previous "input text" and "output text" into our prompt and ask LLMs to assign importance score in a word-to-word fashion.

LLM Diagram

We can also use different prompts like:

different prompt

And we can compare the results of these prompts:

cmp1
cmp2

Future Work

A Future work (what we're doing now) is using Grad-Cam results as ground truth to optimize our prompt:

train-Prompt