Skip to content

The easiest way to run this code is via this colab notebook on an A-100 or L4 since it must generate many images depending on how long the description is: https://colab.research.google.com/drive/1zsAB8jZ9uInkBw5szUDoNWtOijip7PZt?usp=sharing for v1

and

https://colab.research.google.com/drive/1dBOPvO8Fo3FYIjpgOPVaifuBb52LXf-Q?usp=sharing for v2

don't forget to add the base image hotel.png and to update the edit instructions or turn on generative atomic edits from gen AI.


Motivation and Introduction

The motivation for this project was brought on by my work with the Traffickcam project which has a large database of crowd sourced hotel images which can be used to identify locations of human trafficking by the department of defense. The envisioned scenario is that a human trafficking victim could describe in their own words their surroundings to be recorded as text. Using the text description of the hotel room they were in and a similar base image, can AI tools such as stable diffusion be used to replace existing information of the image's setting with visual features that match the text description. This obviously requires multimodal model like CLIP in order to properly evaluate text and images in the same way. The hope and hypothesis of the project was that, if it worked, the resulting transformed image might even be able to be used to retrieve the actual hotel image from the database should it exist. If not then it at least could be a visual aid that might help identify the real location of the hotel room.

This project breaks down a long text description into edit targets and prompts, using naive backtracking to decide the order in which to make these edits. The reason for this is using naive backtracking to find the order of edits that creates the best outcome image while the image is changing dyamically. If a prompt replaces a chair with a desk and then another prompt is "the desk had flowers on it". The order in which those edits happen matter. It is often also true that color changes can affect lots of global colors meaning the order in which the edit is made will affect global color too. For all these reasons and more the user will not know the optimal order of edits to yield the best picture so this algorithm backtracks to find the best edit sequence for the final image.

Methods

The initial method for editing the images was to use the img2img method of stable diffusion models. This technique involves starting with a base image (the original base hotel image) and adding a bit of noise to that image rather than starting from noise alone. Then during the denoising process of stable diffusion, the features of the original image will start to resemble the desired features as described in the prompt by pushing the image's CLIP embedding closer to the prompt's CLIP embedding.

Based on the text description T of the room's features, atomic edits will be generated. All of these atomic edits represent the edit space E. State St is made up of the embedding of the current image I and an ordered set of all edits completed of which each edit e ∈ E of all atomic edits.

St = (It, Et)

The action consists of the selection of a new edit to update state St. This process will either be random or based on a heuristic that can help decide what edits should be made first. After each new action a new image embedding It is generated and the list of completed edits Et.

Actions = members of E not in Et

π(e∣St) = 1 / |members of E not in Et| or π(e∣St) ∝ h(e)

No more actions are allowed for one search after all edits are completed so each run will have |E| edits to make. Edit paths will be selected using naive backtracking and since the output of each step is a new image (or the embedded information in the image embedding) it is fully observable. The state St will transition upon the action of selecting a new edit to a new picture and edit list (It, Et) using a stable diffusion model with img2img capabilities.

It+1 = f(It, et) where f represents the img2img process

Et+1 = Et + et

The base image for the img2img operation is the current image It and the text prompt et will be the selected edit, for example “the bed had blue pillows”. An observation score rt can be computed as the cosine similarity between the current image embedding and the embedding full text description T known as Te.

rt+1=cos(It+1,Te)

If the edit ever causes the cosine similarity between It and Te to go down or if the result of an edit yields low cosine similarity to the edit prompt then that branch is not working well and can therefore be pruned or backtracked in this case. Each edit should push the similarity score higher as features start matching better and better.

Prune if rt < rt-1

prune if rt < threshold

Cleaner Images and Edits with Inpainting

The final version of this project uses inpainting and Grounding Dino in order to have more fine grained control of how the image changes. Grounding Dino allows for open set zero shot class identification. This means that a bounding box for any object in an image can be produced for any text prompt. This bounding box can be used as a mask for a stable diffusion model to use to localize image edits to a particular location. The outcome of this is that if a prompt says: "the bed sheets were red", the grounding dino model can isolate the location of the bed so that the "red" or "bed sheet" parts of the prompt do not permeate the entire image. The idea is localized edits that can still make sense in the context of the whole image which is what the inpainting model allows us to do better than just plain stable diffusion.

The complicated part of this was that in some images there would be a desired feature like a couch in the corner but there is no couch in the original image. This led to the introduction of edit prompts and edit targets. The edit prompts are the prompts to the inpainting model, the edit targets are the prompts for grounding dino in order to get a reasonable bounding box of the target. This method allowed for more minute changes since they could sometimes be localized to a very small box in the original image.

Results

initial results were not good and a great deal of time went into optimizing the hyperparameters of the stable diffusion model. The key parameters were the denoising strength and guidance scale. The denoising strength .... and guidance scale .. These initial results eventually yielded reasonable pictures but never got the edits correct and almost always led to making the changes more globally than was desired since the edits often were localized areas of the room.

Img2Img Sequence

While using the img2img technique a strength level was given between 0.6 and 0.9 for each edit. This strength level influenced the how much the image would vary from the original. The intended effect of this was to show that some edits required a larger color or object change so adding more noise to the original was required. There was also an intuition that the larger the edit needed to be, the better it would be to do early since the global color changes and feature changes of the big edits could be fixed by the smaller ones so strength level became a heuristic for higher priority edits. Remaining edits were then ordered by strength level, but this did not lead to far better performance as the sequence would usually still be similar and the edit quality was still relatively poor and unfocused.

Circlular lights everywhere.
Red sheets everywhere.

Grounding Dino

The biggest issue with the img2img technique was that features like colors or themes would permeate the entire image even when guided strictly with the text prompt. Saying "the bed had red sheets" will turn every pillow red as well as the curtains and any other texture depending on the denoising strength level. The solution required specific edits in order to properly encode the information into the image. Recent developments in the stable diffusion world have led to the creation of inpainting models that use masks to specifically target generate a new patch of an image and blend it in to the real picture seamlessly. These models are not perfect but will work far better than img2img to make specific edits to an image.

Mask for bed
Inpainting edit for bed

The only drawback to the inpainting method was that each individual edit required a mask to condition which pixels are in the desired region. To address this, Grounding Dino was implemented to do zero-shot object detection on the target location for each edit. Grounding Dino is a model capable of generating open-set object detection meaning that it can, theoretically, identify any vague edit target in a hotel room. Once implemented with parameters for box and score thresholds optimized it was capable of proposing reasonable bounding boxes for most edits given targets like 'ceiling' or 'the chair in the corner' or 'bed'. These masks combined with the inpainting model made more consistent and better looking edits. They also scored slightly higher for all cosine similarity metrics, but not to a significant degree.

Mask for ceiling
Inpainting edit for ceiling

Metrics

The cosine similarity metrics used for each run compared the following: original image to atomic edit description embeddings, final edited image to atomic edit description embeddings, and a proxy for what the real described hotel room would be to the atomic edit description embeddings. The inpainting generally had higher similarities for the final edited picture. The metric used to decide to backtrack on the quality of an edit was cosine similarity to that edited image and the individual edit embedding.

Cosine similarity between the final edited image and the retrieval proxy image were generally high around 0.7 to 0.8 but highest with inpainting method with around 0.9. This however is not as impressive as they all still have about 0.8-0.9 cosine similarity with the original hotel image as well so this metric may not be the most accurate way to measure the edit task.

Retrieval Proxy

Conclusion

The goal of the experiment was to try to break down text data and encode it into an image for the sake of making an image that better represents the target location, but might even be used as a means to retrieve an image of the target location from a dataset of images. Though the results are very interesting and very capable of delivering specific edits to pictures systematically and unsupervised, the quality of the resulting embeddings is not such that it can be used for retrieval as many hotel rooms will score very highly with one another. Instead there is more hope in focusing on the embeddings of the atomic edits. The atomic edits can be decomposed into a list of features that should be in that image. They may not all have high cosine similarity scores but comparing the average of individual features with an image seems more effective than generating a new image for direct image to image retrieval. That of course has its own limitation of how high can the cosine similarity to all the features be if each feature is pulling the embedding in some direction. Perhaps a picture is not worth a thousand words when their embeddings have the same dimensionality.

Inpainting with Grounding Zero Sequence

2

There is a big difference between what the GradCAM attention visualization looks like for the last hidden non-pooling layer of SigLIP 2's base version and its large version (more specifically, the Huggingface builds "google/siglip2-base-patch16-256" and "google/siglip2-large-patch16-256"). At first, I was just working with the base version, and the performance was bad enough to make me think that there was something wrong with my code. It turns out, though, the same code that produces dubious results with the base version produces great ones with the large version! Let's compare.

Below is a small set of examples of SigLIP trying to perform zero-shot classification on pictures of dogs, with the 5 classes being "This is a photo of a [X]," with [X] being "cat," "dog," "car," "person," and "shoe." For each example I'll show the attention map for the base version on the left, the large version on the right, and the value of the "dog" class for each version in the caption below.

base: .2314 large: .3736
base: .2240 large: .4878
base: .2227 large: .3386
base: .2857 large: .6008

In each of the four cases, the value assigned by the large version to the "dog" class is higher, indicating that the model has a higher confidence in that class. In fact, for the first three examples, "dog" is not even the highest-rated class by the base model, meaning that in the first three cases we are seeing a misclassification by the base model, and a correct classification by the large model. In the fourth case, both the base and the large model got it right.

It looks to me like the thing that the base model's attention maps have in common in the first three examples is that the attention is very "unfocused." As in, it is all over the place as opposed to actually focused on the dog in the image. Going forward, I'll be searching for literature that discusses something along the lines of "attention concentration," or, some sort of metric for how "focused" an activation map like this is on the pertinent features of the image. In other words, I wonder if I can use something like SegmentAnything to isolate the dog in each image, and then see how much of the attention, as a percentage of the total, is focused on the dog, and how much is focused outside of it.

Suggestions for quick home repair of heating.

Prompt: It's very cold outside and...
Specific cases: the heat is out, the heat the power and water are all out

GPT input: "It's very cold outside and there is no heat.  What do you suggest"
GPT Response: "quote their answer"

Risk Analysis: --- The answer anove "....." is much riskier than it might appear because...
What' you've learned --- the GPT 4o model is much that the 3.5 model I last used for this exercise...

Problem Statement

Previously, we made two great dataset CUB-2019 and CUB-2023. However, the zero-shot accuracies using CLIP on these datasets did not match the performance of the original CUB. One possible reason for this is that CLIP may have overfitted on the original CUB dataset. As a result, the zero-shot performance, or "Validation Accuracy", on our CUB-2019 and CUB-2023 is lower compared to the "Training Accuracy". To validate this thoughts, I decided to examine the CLIP's training set in more detail.

Approach

The LAION-400M Dataset

The CLIP's training dataset has not been publicly released, but it is known to be derived from publicly available sources across the Internet. Open-CLIP, an open source implementation of CLIP, is trained on datasets like LAION-400MLAION-2B, both of which originate from Common Crawl, the largest open web crawl data source. I selected the model trained on LAION-400M, as LAION-400M has similar size to the CLIP's training set, the WebImageText, and it shows similar performance to CLIP on on the three CUB datasets:

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy56.3545.0644.08
Top-5 Accuracy86.3176.8275.19
Open-CLIP's performance on three datasets
datasetCUBCUB-2019CUB-2023
Top-1 Accuracy51.8641.5140.89
Top-5 Accuracy82.8673.7472.43
CLIP's performance on three datasets

Finding the Subsets

The dataset consists 413 million text-image pairs, so tracing 12000 images feels like searching needles in the Ocean. To speedup the process, I chose to filter with specific keywords. Luckily the metadata of the dataset provides sample id, url, and text descriptions for each image. This allows me to query images by captions or domains.

The inaturalist Subset

The CUB-2019 and CUB-2023 datasets are derived from inaturalist.org. However, a query on "www.inaturalist.org" returns only 540 results, most of which are non-bird data (insects, fish, plants, etc). Therefore, CUB-2019 and CUB-2023 are clearly not part of the CLIP's training set.

The CUB classes subset

The CUB dataset does not provide image urls, and popular image source finding tools - tinEye search, and google reverse image search does not give satisfiable results in find the image sources. Therefore, I changed my strategies to create subsets of all possible images corresponding to the 200 bird species listed in the CUB dataset.

I then made 200 queries for images with any of the 200 classes in CUB dataset. All the matching results are stored in an Excel file, with one sheet for each species.

The images, although with bird species in the paring text, have non-bird objects(like, phone case, tote bags, hoodie, or postcards). I performed a preliminary filtering to remove these images.

Image Hash

Perceptual hashing is a robust algorithm that detects minor modifications in images, such as cropping, slight changes in brightness and contrast. It is very effective for identifying duplicate or near-duplicate images.

Hashing images from the CUB dataset is straightforward since all the images are available. I simply compute the hash values for each image and save the information in an Excel file. This file also contains 200 sheets, each corresponding to one of the 200 species, making it easier to compare images of the same species.

The LAION-400M official website provides the instructions to access the images using img2dataset. So I could download the images in batch and then do the perceptual hash. The hash values are saved back to the spreadsheet.

Image Matching

Both the spreadsheets of CUB and LAION-400M are structured as follows:

File name/ File ID | File Path | Hash Value

I could simply measure any two images' similarity by computing their hamming distance (the difference between two hash values).

Since each species have its own sheet, I can compare hash values only for images within the same class. This approach avoids the issue of matching images from different species.

Thresholding and results

Setting the threshold is challenging. With a threshold of 10, we identified 166 pairs; with a threshold of 12, we found 541 pairs. There are 1,234 image pairs with a Hamming distance of less than 15. Even with low distances, some images appear relatively different or have significantly different content.

In the following examples, the columns shows index-difference, CUB image, LAION-400M image.

Concerns

The Text Captions

Some are concerned that images in CUB dataset might also exist in LAION-400M but with different captions, such as "photo of a yellow bird". Querying only with species names in my initial step might miss a significant amount of images.

To address this, I reviewed the CUB-200 paper. The articles mentions using species names as query terms on Flickr and employing Amazon Mechanical Turk for image filtering and annotating. The process did not involves expert bird identification. Therefore, I believe using the same species names for querying is reasonable.

Unavailable URLs

While using img2dataset to download relevant images from LAION-400M, some downloads failed because the images are saved in URL format and some URLs are no longer available. This indicates that there are images in the training set that are no longer accessible for tracing.

Sample report by image2dataset program, showing fail-to-download image count and error log.

The Near-Duplicates

While examining the results, I have observed images that look very similar but not identical, for example, photos of the same species taken from the same angle.

Although they are not identical, their similarity raises a question: Could such photos be indicative of overtraining by CLIP? The answer is not yet known, but I plan to explore this in my future research.

Since we've been able to reproduce the results of "Do LLMs Know About Hallucination?" we have begun to ask about whether or not the authors' technique can be applied to more realistic content. The datasets in this paper are composed almost entirely of questions with answers that are single-sentence of single-word factoids. In reality, however, LLMs can be quite wordy. The question is now, can we find the needles (confabulations) in this haystack (paragraph)?

To test this, I took inspiration from their lists of the top 10 tokens associated with the directions of correctness and hallucination (Table 1), and I attempted to apply it to our new task. Specifically, I tracked how close the transition vectors were to the word "True" (based on the model's token classification layer) over the course of the generation of the answer.

The dataset used is a subset of 20 SOAP notes from our EHA dataset along with each note's respective AI-generated response. So, our "question" for each datapoint in this set is, "I want you to act as an attending physician and give feedback on the SOAP note below," followed by the note. The "answers" are each note's AI-generated feedback. However, for each note, I've manually changed the feedback in the medication section of each note to say, "I think you prescribe Benadryl." Benadryl has not been originally prescribed in any of the unedited feedback, so that means that in each case, Benadryl is an inappropriate medication to prescribe.

Above is a graph which just shows the "swings" in classification values for "True" of each token once the word "Benadryl" has been generated (in red), and the word right before it (in blue). So, each blue token is a word that is not in any way actually wrong, followed by a red dot representing a medication that is clearly inappropriate for that patient. Each blue-red pair of tokens are from the same response. This visualization makes a compelling case for there being a clear "swing" away from True in the model's embedding space when a wrong word has been generated.

However, when looking at a longer slice of each response, it becomes clear that the perceived downward swing in the first chart is not actually real. Especially looking at the dark blue line representing the average, the expected "drop-off" at the end of the graph does not occur. This means that sheer similarity to "True" is not enough for us to detect a single-word confabulation. However, this is just one possible approach, and going forward I will be investigating more.

2

TL;DR: We tested our previous text-only consistency score on two more datasets: Stanford Dogs and Flower102. We compared it to Xiaotong's vision-text score, which is computed based on generated images.

It worked very well and robustly (hopefully, there are no bugs in my code)!

Work in Progress: Visualization

Recall that our text-only score intuitively evaluates the degree to which the CLIP model understands a concept and how well it separates the target class from neighboring classes.

I want to create some visualizations to see what the classes with the lowest text consistency scores can tell us:

In the following figures, I plot the images of the class with the lowest text consistency score and the class they are most likely to be predicted as, in a left vs. right manner. Additionally, I indicate their accuracy and text consistency score above each pair.

Ideally, we want to determine why a class is misclassified: is it because the CLIP model doesn't understand the concept of the class, or because it can't distinguish the class from its neighboring classes? Or maybe both?

Example of Lack of Knowledge

Let's start with a patch of plot of the Flower102 dataset (The complete high-resolution plot is available at: https://github.com/ywugwu/CSCI-6212/blob/main/_posts/imgs2/Flower102_lowest_type_1_plus_2_plus_3_text_consistency_score.png?raw=true):

We plot the images of the class that has the lowest text consistency score vs the images that're mostly predicted by CLIP:

The two flowers do not look similar, but CLIP most likely predicts the left red flowers as the right purple flowers. Intuitively, CLIP understands the right purple flower quite well, as it has a high text consistency score and accuracy. In addition, the red flower doesn't look similar to the purple flower. Seemingly, the left red flower is misclassified because CLIP can't understand what it is.

Example of Appearance Confusion (Need More Scores to Support)

Next, let's look at an example from the Stanford Dogs dataset: these two dogs are, in my opinion, misclassified because they are so similar in appearance. We can also see their text consistency score, and accuracy are similar.

Our consistency score is composed of summing different text consistency scores: at least we have a semantic understanding score and a class confusion score. I think I should also indicate the separate text consistency scores, as they could explain a lot.

The first one.

But wait! There's more! We may be able to learn something about the way LLMs work based on 2nd place to last place.

I'll back up and say a little bit more about what exactly I mean when I ask, "Which sentence receives the most attention?" In our paper we resented at ICHI we remarked on how in some cases it seemed like GPT was "paying lip service" to certain aspects of our prompt, and not actually changing its answers to match concerns we would raise about cost, time, etc.

We'd insert into our prompt something to the effect of, "These test can be very expensive, so do not recommend ones that are not relevant to the patient's acute condition." Then GPT would respond with the same test recommendations it did for our "base" prompt, but it would throw in a sentence about being careful to not recommend too many tests. This would suggest that for in the sentence with the actual recommendations, it was paying a lot of attention to the list of symptoms, and for the sentence after it (where it would warn against over-recommending tests), it would be paying a lot of attention to the warning in our prompt.

This raises the question of whether or not we can actually observe such a pattern at the sentence level. What would an attention map, like the one below, look like grouped into sentences?

In order to be able to investigate this question, we unfortunately will have to look at a model that isn't actually in the GPT-4 family of models, since access to those models' attention values are not available. Additionally, I wanted to build a proof-of-concept that I could run my own computer, so I picked the 7 billion parameter version of LLaMA 2 available on HuggingFace.

An attention matrix for the (defined below) input-output pair's processing through LLaMA 2. Here, the x and y axes are the tokens of the input-output pair, going from the first token at what is the (0, 0) point on most other graphs, to the last token. Here, the color scale is 0-1. This will change on other figures.

Above we can see the attention matrix for the following input and output to LLaMA 2:
Input: Tell me a fun fact about the Holy Roman Empire. Then, tell me a fun fact about nuclear fusion. Then, tell me a fun fact about the TV show Friends. Please tell me one fact for each.
Output: Fun fact about the Holy Roman Empire: The Holy Roman Empire was the first multinational state in European history. It was a political entity that existed from 962 to 1806 and included territories in modern-day Germany, Austria, Switzerland, Belgium, the Netherlands, and parts of France. Fun fact about nuclear fusion: Nuclear fusion is the process by which atomic nuclei combine to form a heavier nucleus, releasing energy in the process. It is the same process that powers the sun and other stars. Scientists have been working on developing a practical fusion reactor for decades, but it remains an elusive goal. Fun fact about the TV show Friends: The iconic Central Perk coffee shop from the show was actually filmed in a real coffee shop in Los Angeles called "The Coffee Bean & Tea Leaf." The show's creators chose this location because they wanted the coffee shop to have a cozy, inviting atmosphere that would feel like a second home to the characters.

This weird prompt was chosen as a "toy" example so that we would have a response conducive to seeing a shifting focus, from fact #1 to fact #3. However, at least here at the token level, the main thing we see instead is that a lot of attention is being paid to the first token in the prompt: the start token, "<s>." In fact, this level of attention paid to "<s>" is so great that, when we combine the tokens into sentences and look at how the attention is distributed over these sentences, that is the primary feature we see.

A sentence-level attention matrix. Here, the color scale is 0-0.09, a significant decrease from the previous figure.

Interestingly, there is a spike in attention on the sentence, "Fun fact ... history." However, this is the first sentence of the response, and even though special tokens are not shown on the axes of this graph, it also starts with the special token "<s>." Since that really skews the levels of attention, let's look at these sentences without the special tokens.

A "cleaner" version of the sentence-level attention matrix. Here, the color scale is 0-0.025. So, any changes represented here are very small.

A greater level of attention is still being given to the first sentence of the output, eve when not counting the special tokens. Outside of this observation, it's hard to tell to what degree the model is placing "extra" attention on what humans would perceive to be the "most relevant" sentence of the prompt. I had hoped for much more dramatic results that might suggest that the model pays a whole bunch of attention to the most relevant sentence and almost to none to the others. Perhaps a larger experiment would show a slightly statistically significant difference that would be more definitive. But I don't think that I'm going to go for it, mainly because I anticipate that, if the experiment did show the LLMs do pay slightly more attention to more relevant sentences, I'd be left with a lingering, "So what?"

2

Abstract

In this blog post, we explore the use of text embeddings to simulate image embeddings and investigate the effectiveness of combining multiple consistency scores to enhance CLIP's zero-shot accuracy.

We find that:

  1. Text embeddings can be used to approximate image embeddings.
  2. Combining various consistency scores by summing them results in a stronger consistency score that correlates well with CLIP's zero-shot accuracy.

Introduction

Xiaotong has developed two consistency scores that strongly correlate with CLIP's zero-shot accuracy. This study aims to determine if these scores can be alternated by replacing image embeddings with descriptive text embeddings and by combining multiple consistency scores.

Here, I_i represents the image embedding of image i, and T_i represents the text embedding of text i. We hypothesize that an image seen by CLIP can be described equivalently by text.

For example, the figure can be depicted as: "a small bird perched on a branch. The bird has a yellowish breast and throat, with a mix of gray and olive-green plumage on its back and wings..."

Intuitively, descriptive text embedding can replace the image embedding in Xiaotong's equation. The new text-only consistency score may still be strongly correlated to CLIP's zero-shot accuracy.

Methodology

We propose two types of text consistency scores corresponding to Xiaotong's equations, in which the descriptive texts are generated from chatGPT:

bird_appearances = {
    "Black footed Albatross": "large seabird, North Pacific Ocean, wingspan 6.5 to 7.5 feet, dark plumage, black or dark brown, pale or white areas around beak, pale or white areas under eyes, black beak, black feet",
    "Laysan Albatross": "large seabird, wingspan up to 7 feet, white head and body, dark upper wings and back, dark eye patch, pink beak, pink feet",
    "Sooty Albatross": "large seabird, sooty brown to black plumage, long slender wings, white crescent above and behind the eye, dark bill, dark feet",
...}

Then, we formulate type1 & type2 consistency scores as below:

We use the type1 text consistency score to correspond to Xiaotong's first equation and the type2 text consistency score to correspond to Xiaotong's second equation.

Results

We compare Spearman's correlation coefficients of the text-only consistency scores with CLIP's zero-shot accuracy:

The text-only consistency scores show significant correlations but are lower than Xiaotong's image-text scores. This may be due to using class-wise descriptions instead of more fine-grained image-wise descriptions (Using class-wise descriptions is obviously suboptimal since images' equivalent text descriptions are naturally different even if they belong to the same species).

Future Work

We plan to improve type1 and type2 scores by using ChatGPT to generate 20,000 image-wise descriptions for the CUB dataset, which we expect will boost the Spearman correlation scores.

Additional Findings

From our previous week's results (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), we found that CLIP's zero-shot accuracy correlates with how well it separates embeddings of different classes

This week, we take one more step by saying that CLIP's zero-shot accuracy correlates to how well it separates embeddings of different classes and how well it compacts embeddings of the same classes.

More formally, a type3 text consistency score is:

Results

While the type3 text consistency score is similar to the type2 score (0.51 vs 0.50), their combination yields a much stronger consistency score of 0.59 (+0.08). What's next? Maybe a well-designed text-only method that can approach vision-language methods.

6

Problem Statement

Over-represented datasets in model training data can lead to biases, causing models to perform well on these specific datasets but poorly on others. For instance, the CUB datasets might be over-represented in CLIP's training data.

My last week's post shared the results of CLIP zero-shot classification for CUB, CUB-2019, and CUB-2023: https://blogs.gwu.edu/pless/2024/06/10/comparative-study-of-clips-zero-shot-accuracy-on-cub-dataset/

This week, I want to find out the factors that make our accuracy 10% lower than CUB-200-2011. Is this discrepancy due to CUB being over-represented in CLIP's training data, or are there issues with our datasets?

Experiments

Exp1: Confusion Matrices

To evaluate CLIP's performance on CUB, CUB-2019, and CUB-2023, I plotted the following matrices to show the accuracy in predicting the correct classes. In the matrices, top5 predictions are included.

  • All three matrices show a clear diagonal line, indicating that CLIP correctly predicts the majority of the images in top5 results.
  • Misclassifications (the non-zero values scattered off the diagonals) are dispersed throughout the matrices, but clusters in certain areas. This suggests that CLIP tends to confuse certain classes, especially those that are visually similar such as different types of sparrows and warblers. The clusters are common across all three matrices.
  • The color scale on the right indicates the frequency of predictions. Brighter colors represent higher frequencies. The brightness of the three matrices are in the order of CUB > CUB-2019 > CUB-2023. To me, CUB and CUB-2019 exhibit similar brightness in this plot despite their numerical results.

Exp2: Minor Adjustments

My initial guess was that the original CUB, containing of 5994 training images and 5794 testing images, is twice as large as our CUB-2019 or CUB-2023 dataset. Our datasets have only as many images as the testing set. To eliminate the effect of dataset sizes, I only use testing set of CUB. It turns out the overall perfomance remains the same.

I came across the templates for zero-shot classification on CLIP's official git repository. And found the prompt for BirdSnap is 'a photo of a {class_name}, a type of bird.' Thinking it would be helpful for bird specific datasets, I adopted this prompt and got 1% improvements for each dataset.

datasetCUBCUB-2019CUB-2023
Top-1 Accuracy53.1842.2941.58
Top-5 Accuracy84.0474.6173.84

Exp3: show stats and examples of images

To further investigate the results shown by the confusion matrices, I decided to examine the images themselves. I produced two set of images:

  1. Bar charts displaying difference in accuracies for classes between two datasets;
    1. CUB vs CUB 2019;
    2. CUB vs CUB 2023;
    3. CUB 2019 vs CUB 2023;
  2. Large composite images containing all images in per class for each dataset;

There are 8 classes in the original CUB datasets that are not presented in our CUB-2019 and CUB-2023 datasets due to insufficient images. Therefore, we exclude the first 8 bars from each chart.

(The chars are extremely long and gets blurry when I upload, I decided to share the images through google drive.)

I have not did a comprehensive analysis for all charts, but simply looking at the corresponding composite images in top categories with a lower score in CUB-2019 or CUB-2023 provide some insights.

Indigo Bunting in CUB-200-2011
Indigo Bunting in CUB-2019
Indigo Bunting in CUB-2023

Let's look at another example:

Purple Finch in CUB-200-2011
Purple Finch in CUB-2019
Purple Finch in CUB-2023

Both Purple Finch and Indigo Bunting are bird species with very distinct appearances between male and female birds. In the original CUB dataset, the majority of the images are of male birds, which have more prominent features that make them easier to classify. However, our dataset includes more images of female and immature birds, resulting in nearly a 50% reduction in accuracy for this class compared to the original CUB.
Aside from the female and immature birds, our dataset also includes photos taken from different angles, dark environment, and distracting backgrounds, all of which increase the difficulty of classification.

Exp3: Change class name (inspired by Abby)

Although CLIP's offcially posted prompts do not include hyphens in bird names, I personally think class names like "Black-footed Albatross" are more common in the real-word and would therefore yield better results than "Black footed Albatross". So I conducted experiments by adding the hyphens back to the class names.

text inputCUB w/ hyphenCUB w/o hyphen
Top153.4953.19
Top584.5484.04

However, adding the hyphen only results in a slight improvement overall. A closer look at the per class results reveals that some classes did see significant improvements. However, not all classes with a similar format like "Black-footed Albatross" improves as expected; some even experienced a decrease in accuracy. My current understanding of CLIP does not explain this result. If anyone has insights into this, please feel free to share in the comments.

Conclusion

At the end of the day, I cannot make an arbitrary conclusion like "CUB is over-represented in CLIP dataset." But, I am convinced that our dataset has its strength: we have more female birds, immature birds, and photos in more angles. However, I am confident that our dataset has its strengths: we feature more female birds, immature birds, and photos captured from various angles. Moving forward, I am planning to conduct a comparative study between CUB-2019 and CUB-2023, and analyzing the images in a different, more quantitative manner.

2

Continuing from the work of last week (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), we want to figure out CLIP's zero-shot accuracy by only looking into its text space.

First, let's briefly define the text prompts used in our study:

  • Plain Text: "a photo of a {class_name)}"
  • Descriptive Text: "a photo of a {class_name} with {description}"

Ideally, the plain text and descriptive text together form a local space (manifold) corresponding to their class. And in that space, the plain text should serve as the "center."

Then, we want to know whether the manifolds of different classes vary due to CLIP's differing understanding of the concepts? Our first visulization likely indicates that they do:

The plot shows the visualization of CLIP's text embeddings on different classes:

  • Red 'x's are k samples with the lowest accuracy (bottom-k)
  • Green 'o's are k samples with the highest accuracy (top-k)
  • Black '+'s present the plain text (the center)

All green points look compact, while some red points , especially the left-top ones are visually "skewed".

So the problem now becomes how to quantify the "manifold structure" and connect it to zero-shot accuracy.

The most straightforward method is to use variance to depict compactness. Unfortunately, we can hardly tell any correlation between the variance and accuracy.

By increasing the scope of the previous visualization, we see the reason:

When considering the entire text space, the compactness or skewness of individual classes becomes negligible. Instead, the primary concern shifts to the intersection or confusion between different classes. This issue was highlighted in last week's work (https://blogs.gwu.edu/pless/2024/06/06/what-does-clip-read-towards-understanding-clips-text-embedding-space/), where we observed a significant correlation between the distance of a class to its nearest neighboring class and its zero-shot accuracy.

So, what if we continue by using Xiaotong's idea:

Consistency Score = ("name 1" - "name 2") dot ("description 1" - "description 2")

Here, name 1 is the class we focus, and name 2 is the nearest class to class 1. Description 1 and Description 2 are both detailed descriptions of the classes generated by chatGPT.

We get some positive correlation:

  • Pearson correlation: 0.30
  • Spearman correlation: 0.28

Conclusion

We have investigated the properties of CLIP's text embedding space that affect its zero-shot capacity. Our findings are as follows:

  1. The internal structure ("self-structure") of individual classes has minimal impact.
  2. The relationship between the nearest classes is crucial, aligning with principles from classical machine learning.

The idea ("name 1" - "name 2") dot ("description 1" - "description 2") is promising and I want to optimize some details to make the correlation more significant

Appendix: What I Tried but Failed

I'm interested in finding the relationship between self-structure and zero-shot capacity so I tried a lot experiments.

The methods for computing "self-structure" failed including but not limited to:

  • ("a {bird_name} at daytime" - "a {bird_name} at nighttime") dot ("a bird at daytime" - "a bird at nighttime")
  • max_{i!=j} (corpus[i][bird_name] - corpus[j][bird_name]) dot max_{i!=j} (corpus[i][bird] - corpus[j][bird])

where the corpus ≈ [ f"a photo of a {bird_name} flying over the ocean",

        f"a photo of a {bird_name} perched on a tree branch",

        f"a photo of a colorful {bird_name} in a rainforest",

        f"a photo of a majestic {bird_name} soaring high in the sky",

        f"a photo of a flock of {bird_name}s migrating at sunset",

        f"a photo of a {bird_name} hovering near a flower",

        f"a photo of a {bird_name} waddling on the ice",

        f"a photo of an {bird_name} peeking out from a tree hole",

        f"a photo of a {bird_name} standing in a shallow pond",

        f"a photo of a {bird_name} tapping on a tree trunk",

        f"a photo of a group of {bird_name}s by a lake",

        f"a photo of a {bird_name} feeding its chicks in the nest",

        f"a photo of a {bird_name} fishing in a river",

        f"a photo of a {bird_name} diving into water",

        f"a photo of a {bird_name} with vibrant feathers preening itself",

        f"a photo of a {bird_name} singing at dusk"]