Skip to content

Is Text as Good as Image? Pseudo Image Embedding and Ensemble of Text Consistency Scores

Abstract

In this blog post, we explore the use of text embeddings to simulate image embeddings and investigate the effectiveness of combining multiple consistency scores to enhance CLIP's zero-shot accuracy.

We find that:

  1. Text embeddings can be used to approximate image embeddings.
  2. Combining various consistency scores by summing them results in a stronger consistency score that correlates well with CLIP's zero-shot accuracy.

Introduction

Xiaotong has developed two consistency scores that strongly correlate with CLIP's zero-shot accuracy. This study aims to determine if these scores can be alternated by replacing image embeddings with descriptive text embeddings and by combining multiple consistency scores.

Here, I_i represents the image embedding of image i, and T_i represents the text embedding of text i. We hypothesize that an image seen by CLIP can be described equivalently by text.

For example, the figure can be depicted as: "a small bird perched on a branch. The bird has a yellowish breast and throat, with a mix of gray and olive-green plumage on its back and wings..."

Intuitively, descriptive text embedding can replace the image embedding in Xiaotong's equation. The new text-only consistency score may still be strongly correlated to CLIP's zero-shot accuracy.

Methodology

We propose two types of text consistency scores corresponding to Xiaotong's equations, in which the descriptive texts are generated from chatGPT:

bird_appearances = {
    "Black footed Albatross": "large seabird, North Pacific Ocean, wingspan 6.5 to 7.5 feet, dark plumage, black or dark brown, pale or white areas around beak, pale or white areas under eyes, black beak, black feet",
    "Laysan Albatross": "large seabird, wingspan up to 7 feet, white head and body, dark upper wings and back, dark eye patch, pink beak, pink feet",
    "Sooty Albatross": "large seabird, sooty brown to black plumage, long slender wings, white crescent above and behind the eye, dark bill, dark feet",
...}

Then, we formulate type1 & type2 consistency scores as below:

We use the type1 text consistency score to correspond to Xiaotong's first equation and the type2 text consistency score to correspond to Xiaotong's second equation.

Results

We compare Spearman's correlation coefficients of the text-only consistency scores with CLIP's zero-shot accuracy:

The text-only consistency scores show significant correlations but are lower than Xiaotong's image-text scores. This may be due to using class-wise descriptions instead of more fine-grained image-wise descriptions (Using class-wise descriptions is obviously suboptimal since images' equivalent text descriptions are naturally different even if they belong to the same species).

Future Work

We plan to improve type1 and type2 scores by using ChatGPT to generate 20,000 image-wise descriptions for the CUB dataset, which we expect will boost the Spearman correlation scores.

Additional Findings

From our previous week's results (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), we found that CLIP's zero-shot accuracy correlates with how well it separates embeddings of different classes

This week, we take one more step by saying that CLIP's zero-shot accuracy correlates to how well it separates embeddings of different classes and how well it compacts embeddings of the same classes.

More formally, a type3 text consistency score is:

Results

While the type3 text consistency score is similar to the type2 score (0.51 vs 0.50), their combination yields a much stronger consistency score of 0.59 (+0.08). What's next? Maybe a well-designed text-only method that can approach vision-language methods.

2 thoughts on “Is Text as Good as Image? Pseudo Image Embedding and Ensemble of Text Consistency Scores

  1. pless

    Hi! Great post. I'm actually more optimistic than I think you are about the text things. A question about this: when you compute "Xiaotong's score", that is based on comparing the image embeddings with the text embeddings in various ways.

    Are you using the actual CUB images to compute that score? Even using "training CUB images" to compute the score, and testing CUB images to compute the accuracy is using real images from the same distribution.

    I think the fair test is when you use the stable-diffusion generated images to create the images used to generate that score, and I think the correlation is much lower then.

    Can you clarify where the "xiaotong score" comes from? If it is helpful, you can ask Xiaotong/Kevin for the created images or how to make them

    Reply
    1. yu.wu1

      I used the results from Xiaotong's paper "Will CLIP Zero-Shot?: Predicting Zero-Shot Classification Performance Without Any Labelled Data".

      In his paper, he also reports the consistency scores on generated images, which are 0.67 and 0.72. So, yes the scores are much lower.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *