Skip to content

Continuing Revealing CLIP’s Zero-Shot Capacity from Text Embedding Space

Continuing from the work of last week (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), we want to figure out CLIP's zero-shot accuracy by only looking into its text space.

First, let's briefly define the text prompts used in our study:

  • Plain Text: "a photo of a {class_name)}"
  • Descriptive Text: "a photo of a {class_name} with {description}"

Ideally, the plain text and descriptive text together form a local space (manifold) corresponding to their class. And in that space, the plain text should serve as the "center."

Then, we want to know whether the manifolds of different classes vary due to CLIP's differing understanding of the concepts? Our first visulization likely indicates that they do:

The plot shows the visualization of CLIP's text embeddings on different classes:

  • Red 'x's are k samples with the lowest accuracy (bottom-k)
  • Green 'o's are k samples with the highest accuracy (top-k)
  • Black '+'s present the plain text (the center)

All green points look compact, while some red points , especially the left-top ones are visually "skewed".

So the problem now becomes how to quantify the "manifold structure" and connect it to zero-shot accuracy.

The most straightforward method is to use variance to depict compactness. Unfortunately, we can hardly tell any correlation between the variance and accuracy.

By increasing the scope of the previous visualization, we see the reason:

When considering the entire text space, the compactness or skewness of individual classes becomes negligible. Instead, the primary concern shifts to the intersection or confusion between different classes. This issue was highlighted in last week's work (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), where we observed a significant correlation between the distance of a class to its nearest neighboring class and its zero-shot accuracy.

So, what if we continue by using Xiaotong's idea:

Consistency Score = ("name 1" - "name 2") dot ("description 1" - "description 2")

Here, name 1 is the class we focus, and name 2 is the nearest class to class 1. Description 1 and Description 2 are both detailed descriptions of the classes generated by chatGPT.

We get some positive correlation:

  • Pearson correlation: 0.30
  • Spearman correlation: 0.28

Conclusion

We have investigated the properties of CLIP's text embedding space that affect its zero-shot capacity. Our findings are as follows:

  1. The internal structure ("self-structure") of individual classes has minimal impact.
  2. The relationship between the nearest classes is crucial, aligning with principles from classical machine learning.

The idea ("name 1" - "name 2") dot ("description 1" - "description 2") is promising and I want to optimize some details to make the correlation more significant

Appendix: What I Tried but Failed

I'm interested in finding the relationship between self-structure and zero-shot capacity so I tried a lot experiments.

The methods for computing "self-structure" failed including but not limited to:

  • ("a {bird_name} at daytime" - "a {bird_name} at nighttime") dot ("a bird at daytime" - "a bird at nighttime")
  • max_{i!=j} (corpus[i][bird_name] - corpus[j][bird_name]) dot max_{i!=j} (corpus[i][bird] - corpus[j][bird])

where the corpus ≈ [ f"a photo of a {bird_name} flying over the ocean",

        f"a photo of a {bird_name} perched on a tree branch",

        f"a photo of a colorful {bird_name} in a rainforest",

        f"a photo of a majestic {bird_name} soaring high in the sky",

        f"a photo of a flock of {bird_name}s migrating at sunset",

        f"a photo of a {bird_name} hovering near a flower",

        f"a photo of a {bird_name} waddling on the ice",

        f"a photo of an {bird_name} peeking out from a tree hole",

        f"a photo of a {bird_name} standing in a shallow pond",

        f"a photo of a {bird_name} tapping on a tree trunk",

        f"a photo of a group of {bird_name}s by a lake",

        f"a photo of a {bird_name} feeding its chicks in the nest",

        f"a photo of a {bird_name} fishing in a river",

        f"a photo of a {bird_name} diving into water",

        f"a photo of a {bird_name} with vibrant feathers preening itself",

        f"a photo of a {bird_name} singing at dusk"]

2 thoughts on “Continuing Revealing CLIP’s Zero-Shot Capacity from Text Embedding Space

  1. Grady McPeak

    Awesome post, Yu! I have two thoughts:

    First, I don't think I understand what the "consistency score" in that one figure is. My guess is that it is some sort of metric determining how close or far from its nearest neighbor a certain class or class description is. Am I on the right track? Could you clarify this for me?

    Second, I think you're right that the connection between your findings here and classical machine learning concepts at least "feels" very real. Even Xiaotong's idea of ("name 1" - "name 2") dot ("description 1" - "description 2") reminds me of the famous result from the Word2Vec paper that, based on the embeddings their model learned, the vectors "king" minus "man" equaled "queen" or something like that. It's a very cool observation you've made that the "older" stuff is still relevant today!

    Reply
    1. yu.wu1

      Sorry I forgot to mention the consistency score is ("name 1" - "name 2") dot ("description 1" - "description 2").

      I also heard of the Word2Vec stuff and I'm striving to find some intuitive and useful rules like this Word2Vec stuff these days. But it seems, at least for CLIP, the embedding space is sometimes different.

      For example, I plot the similarities between embeddings of man, woman, queen, king, man + queen - king, woman + king - queen at https://github.com/ywugwu/CSCI-6212/blob/main/_posts/imgs2/image.png

      The intuition is man + queen - king is supposed to be close to woman, and woman + king - queen should be close to man. But in fact, man + queen - king is more like man than woman. And woman + king - queen is more like woman or king than man.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *