What does CLIP “Read”? Towards Understanding CLIP’s Text Embedding Space 7

yu.wu1June 6, 2024June 6, 20247 Comments

TL;DR: We want to predict CLIP's zero-shot ability by only seeing its text embedding space. We made two hypotheses:

CLIP’s zero-shot ability is related to its understanding of ornithological domain knowledge, such that the text embedding of a simple prompt (e.g., "a photo of a Heermann Gull") aligns closely with a detailed descriptive prompt of the same bird. (This hypothesis was not supported by our findings)
CLIP’s zero-shot ability is related to how well it separates one class's text embedding from the nearest text embedding of a different class. (This hypothesis showed moderate support)

Hypothesis 1:

Motivation

How would a bird expert tell the difference between a California gull and a Heermann's Gull?

A California Gull has a yellow bill with a black ring and red spot, gray back and wings with white underparts, and yellow legs, whereas a Heermann's Gull has a bright red bill with a black tip, dark gray body, and black legs.

Experts utilize domain knowledge/unique appearance characteristics to classify species.

Thus, we hypothesize that, if the multimodal training of CLIP makes CLIP understand the same domain knowledge of experts, the text embedding of "a photo of a Heermann Gull" (let's denote it asplain_prompt(Heermann Gull)) shall be close (and vice versa) to the text embedding of "a photo of a bird with Gray body and wings, white head during breeding season plumage, Bright red with black tip bill, Black legs, Medium size. Note that it has a Bright red bill with a black tip, gray body, wings, and white head during the breeding season." (let's denote it as descriptive_prompt(Heermann Gull)).

For example, the cosine similarity between the two prompts of the Chuck-will's-widow is 0.44 (lowest value across the CUB dataset), and the zero-shot accuracy on this species is precisely 0.

Then, we can formulate our hypothesis as follows

(T_* denotes the text embedding of *):

We tested our hypothesis in the CUB dataset.

Qualitative and Quantitative Results

The cosine similarity between "a photo of Yellow breasted Chat" and "a photo of a bird with Olive green back, bright yellow breast plumage" is 0.82, which is the highest value across the whole CUB dataset. However, the zero-shot accuracy on this species is 10% (average accuracy is 51%)

We got the Pearson correlation coefficient and the Spearman correlation coefficient between accuracy and the text embedding similarity as follows:

Pearson correlation coefficient = -0.14, p-value: 0.05
Spearman correlation coefficient = -0.14 p-value: 0.05

The coefficients suggest a very weak linear correlation.

We also make a line plot of accuracy vs. text embedding similarity, which shows no meaningful trends (maybe we can say the zero-shot accuracy tends to zero if the text embedding similarity score is lower than 0.50):

Thus, we conclude that the hypothesis is not supported.

I think there are possibly two reasons:

The lack of correlation might be due to the nature of CLIP's training data, where captions are often not descriptive
CLIP does not utilize domain knowledge in the same way humans do

Hypothesis 2

Motivation

We examine the species with nearly zero CLIP accuracy:

The left ones are input images, and the right ones are the images of the most predicted species for the input.

We can see that they are close in appearance. Therefore, we wonder if their text embeddings are close as well.

More formally, we want to examine the cosine similarity between one species' text embedding and its nearest text embedding to see if CLIP's inability to distinguish them at the semantic level possibly causes the classification to fail.

Qualitative and Quantitative Results

We also got the Pearson correlation coefficient and the Spearman correlation coefficient:

Pearson correlation coefficient = -0.43 p-value: 1.3581472478043406e-10
Spearman correlation coefficient = -0.43 p-value: 1.3317673165555703e-10

which suggests a significant but moderate negative correlation.

And a very noisy plot ......

Wait, what if we smooth the line plot by averaging every 20 points to 1 point:

The trend looks clearer, although there is still an "outlier."

In conclusion, I think we can't determine whether CLIP's zero-shot without giving the information/context of other classes. For example, CLIP completely failed to classify a California gull vs. a Heermann's Gull, while it perfectly solves the problem of, e.g., a banana vs. a Heermann's Gull.

Next step, I want to investigate:

Are there some special local geometry properties that are related to the zero-shot ability?
When and why does CLIP's zero-shot prediction fail? Is it because the image encoder misses the detailed high-resolution features, or is it because the text encoder fails to encode the most "unique" semantic information. Or maybe it is just because we need a larger model to align the image embedding space and the text embedding space.

Let LLMs Introspect

yu.wu1May 31, 2024May 31, 2024Leave a comment

TL;DR: We can prompt chatGPT to generate an "attention map" for themselves (demo available at https://ywugwu.streamlit.app/).

Currently, we're working on getting better prompts via open-source LLMs like Llama3.

Introduction

We're interested in letting LLMs introspect. That is, let an LLM answer which part of the input contributes the most to the word we're interested in the output text (like an NLP-version Grad-Cam but by Prompting:

We want an NLP-version Grad-Cam (https://arxiv.org/pdf/1610.02391) but by Prompting

We have a demo at https://ywugwu.streamlit.app/ that can do this:

Method

An overview of our prompt: we merge the previous "input text" and "output text" into our prompt and ask LLMs to assign importance score in a word-to-word fashion.

We can also use different prompts like:

And we can compare the results of these prompts:

Future Work

A Future work (what we're doing now) is using Grad-Cam results as ground truth to optimize our prompt:

Reimplementation of FGVC Papers

Le SunMay 30, 2024June 1, 2024Leave a comment

This week I am working on reimplementing experiments in the field of fine-grained visual classification.The data set used for this study is CUB-200-2011, a fine-grained bird classification dataset.

Summary

Method	Top-1 Accuracy - My Result	Top-1 Accuracy - Original Result	Code
FFVT	91.62	91.6	link
Vit-NeT	91.6	91.7	link
TransFG	NA	91.7	link
IELT	91.267	91.8	link
SAC	NA	91.8	link
HERBS	93.01	93.1	link

Details

FFVT

IELT

Vit-Net

HERBS

TransFG: Due to GPU limitations, the training steps were not completed. However, I successfully migrated the workflow to Google Colab and optimized the data loading steps, reducing the time from 40 minutes to 2 minutes.

SAC: Learned how to set up a virtual environment to incorporate TensorFlow 1.45 and Python 3.7, but encountered issues on a MacBook due to hardware limitations.

Playing with CLIP features for Re-ID

plessMay 24, 2024May 24, 2024Leave a comment

I wanted to understand something about CLIP features for Re-ID, the process of tracking someone for a while, then re-recognizing them (for example, when they come into view of the next camera).

The wb-wob-reid-dataset dataset has examples of hundreds of people, captured when they are wearing the same clothes (but from different viewpoints), but also wearing different clothes. In this post I'm not concerned with training CLIP, but just using it and trying to understand how well CLIP features already support the re-id task.

For the record, I used the openai/clip-vit-base-patch32 model, and computed embedding features for all the both_small/bounding_box_test images. Something like 10 images can be embedded per second on the CPU in my laptop, so all this runs pretty quickly.
First, in the vein of "visualize everything", here is the t-SNE plot of 2000 images of 250 people. Each person is a class, and the images of the same person (or class) have the same color

I can't understand that so well. You can maybe convince yourself that there are clusters of the same color (?) but not clear, and we also aren't taking advantage of the extra label that the dataset has, which is which images of a person have the same clothes. So I wrote code to iterate through the images and mark all the images of a person, with different tags for different outfits. Then I can show all the images as gray, with images from one person highlighted, color coded by the label of their outfit.

Sometimes this shows clusters corresponding to a particular person, in all outfits, like the super unique "Person 64":

But more often you have multiple clusters per person:

and the most common is actually that the images of the person are pretty spread out across the space:

I always like to better understand my data, so let's look at the these three people. First, the unique person 64, seems to have the same dress on, and be in the same kind of background in each picture:

person 65, also pretty unique clothes, but sometimes more hidden by the backpack:

and person 66, with some bright red style (?) flourishes, that were still not enough to have the CLIP features pull these images all together.

Now, the next step (coming soon), is to see if we can find good parts of the CLIP features space that will help automatically merge the features from the same people, so images of one person aren't spread out across the whole space.

Welcome to Summer of 2024

plessMay 24, 2024May 24, 2024Leave a comment

First blog post of 2024. I want to welcome you all (Grady, Kevin, Yu, Het, Le, and Manal) to our summer.

This blog platform is a place where I hope that we can put formal updates on our projects. Having a forum where we write a little bit more formally about what is going on is super useful and has a different feel than Slack which is best for short form, informal questions.

For this summer, there are a few events that we may have regularly; at the moment that includes a reading group on Wednesdays at 1:30pm. I think as a group we should propose several days a week that we mostly try to be here (for example, M, W, Th). I expect to be here most days aside from some travel in early July and the middle of August.

There are some pages on this site that might be of interest, including this page of conference targets for our papers: https://blogs.gwu.edu/pless/upcomingconferences/

A quick note about the style of research group that I have --- i like to work towards unusual ideas and high impact papers. It's not my style to say "you must do X", or "I expect you to be here from 9am until 6pm". The students that have done the best with me in the past have taken that flexibility to not be stressed if they have to be out sometimes (to take care of kids, train for marathons, whatever), but then also work very hard when they can. I like that model. They have also tended to be the ones to manage projects, push new ideas to explore and push to finish papers. This is an important skill to learn, and there is value in learning it early.

In terms of this blog, I think one substantial post each week that captures what you've been working on, with nice figures etc... is really useful, and can help make paper writing be easier in the future. You can also write quick update blog posts each day that can serve as a diary, repository of cool datasets or papers you've found, etc.

I'm excited to have you all working with me; I hope that we all have a fun summer!

Camera Calibration with Glitter – Week 5

obroadrickJune 27, 2022June 27, 2022Leave a comment

Roughly the first four weeks of the project were spent characterizing a glitter sheet (estimating the position and orientation of thousands of glitter specs). All that work was done as a necessary first step towards the ultimate goal of calibrating a camera using a single image of a sparkly sheet of glitter. Camera calibration is the task of estimating the parameters of a camera, both intrinsic - like focal length, skew, and distortion - and extrinsic - translation and rotation. In glitter summer week 5, we tackled this problem, armed with our hard-earned characterization from the first four weeks.

We break the problem of camera calibration into discrete steps. First, we estimate the translation. In our camera calibration image we find the sparkling specs. A homography (found using the fiducial markers we have in the corners of the glitter sheet) allows us to map the coordinates of the specs in the image to the canonical glitter coordinate system, the system in which our characterization is stored. Sometimes the specs we map into the glitter coordinate system are nearby several characterized specs, and it is not clear which one of those specs we are seeing sparkle. So, we employ the RANSAC algorithm to zero in on the right set of specs that fit the model. We trace rays from the known light position, to the various specs, and then back into the world. Ultimately we find a some number of specs whose reflected rays all go very near each other in the world, and we use this "inlier" set to estimate the camera position.

Here is an image showing the rays reflected back from the glitter, zoomed in to show how many of them intersect within a small region. There are also outliers. The inliers are used to make a camera position estimate by searching over many positions and minimizing an error function that sums the distances from the position to the inlier rays. The estimated position is shown as well as the position we measured with traditional checkerboard calibration.

The camera position estimate is about 1 centimeter off. Not bad, but not the accuracy we are hoping for in the end. One way to improve accuracy could be to use the brightness of specs in the error function. Brighter specs should give tighter constraints on the position. Here is a plot showing the brightness of specs as a function of how closely their reflected rays pass by the camera position.

Closer specs tend to be brighter, as we'd hope, but the relationship isn't as strong as we might hope. We'll revisit ideas for improving the accuracy later, but for now we move on to attempt the rest of the camera calibration.

Next up is the rotation of the camera (where it is pointed in space) and the intrinsics. We begin with just a subset of the intrinsics we could ultimately estimate: focal length (in x and y directions) and skew. This is a total of 6 parameters we are estimating (3 for rotation and 3 for intrinsics). Since we know some points in the scene (fiducial markers and glitter specs in the characterization), we can find correspondences between points in the image and points in the world. For a point p in the image and corresponding point P in the world, our estimate of the rotation matrix R, intrinsics matrix K, and (already found) translation matrix T should give us p ~ KR(P-T). Our code seems to have worked out well. Here is a diagram showing the observed image points in green and where our estimates of translation, rotation, and intrinsics matrix project the known world points onto the image.

This is very exciting! As week five comes to a close, we have code to (a) characterize a sheet of a glitter and (b) calibrate a camera with a single sheet of glitter. Next week we will look to analyze our calibration by comparing the results to traditional (checkerboard) calibration techniques. Thereafter, we will be looking to reduce error (improve accuracy) and estimate additional intrinsic parameters like distortion! We have some fun ideas for how to improve our system, and we are excited to press forward.

Characterized specs! – Glitter Week 4

obroadrickJune 23, 2022Leave a comment

A bit late in posting it, this entry will provide a brief description of the sparkly happenings of Glitter Summer Week 4, in which we did in fact complete a useable glitter characterization! As described in the week 3 post, there were a couple known issues with the first full characterization. Addressing those and spending a couple days debugging gave a full characterization I trust (within some reasonable error).

Recall that a glitter characterization is knowledge of the *location* of many glitter specs on the sheet as well as the *direction* that they are facing (their surface normals). How do we test a glitter characterization? We shine a light on the glitter, take a picture, and then measure and record where the light, camera, and glitter were positioned relative to each other. Then, our glitter characterization allows us to trace rays from the measured camera position, reflect them off all the specs that were bright in the image, and then see if they intersect roughly where the light was positioned. Here is a diagram showing just that.

Hurray! Many of the specs really do go to the light position on the (green) monitor as we hoped! Some don't, and this could be due to many different sources of error. (If you're wondering, the stubby red lines are the surface normals of the specs we're reflecting the light off.) Ok, how else can we test the characterization?

Well, for the same image as before, we should be able to trace rays from the known light position to ALL of the glitter specs we characterized. Then, we should be able to measure which ones end up reflecting really close to the known camera position. The specs that have their reflected rays go really close to the camera location should be the specs that are sparkling in our image! If I just show the whole glitter image, you can't really make out anything since there are so many tiny sparkles, and so I zoom in to show smaller sparkle patterns. On the left is my predicted pattern (just white for sparkle and black for no sparkle), and on the right is the actual image we captured.

Pretty nice, isn't it? Sometimes we didn't do as well though. Here is an example where we predicted four sparkles but only two appear in the actual image captured.

In a few spots in the image, the predictions are quite bad. Here's an example of that.

We did some other testing as well where the camera, light, and glitter assumed totally new positions relative to where they were during the characterization. This allowed us to make sure we weren't only getting the answers right because of some weird glitch in the geometry our code was depending on.

Finally, there is one more game that we can play, arguably the most important. Remember that the overall goal of the summer is to use glitter to calibrate cameras with a single image. A key part of camera calibration is finding the location of the camera in 3d space. So, we illuminated the glitter with a known light source, traced the rays from light source to shining specs in the image and back into the world, and then we saw that many of them really did intersect where the camera was positioned. In this diagram, we show just the reflected rays (from the sparkling specs back into the world).

Week four ends here, where we quickly implement the code to estimate camera position from these reflected rays, and we get an answer (just for this single first test image) within 1 centimeter of the correct position. Very exciting!

Glitter Spec Surface Normals! – Week 3

obroadrickJune 11, 2022Leave a comment

We want to find the location and surface normals of the hundreds of thousands of specs of glitter on a glitter sheet. With a productive first two weeks of the project, we already have a rig set up for shining light from known locations on a monitor to the glitter square and capturing the sparkles with a fixed camera. We also already have the code that can find glitter specs in images we capture and then find the mean of a Gaussian fit to the brightness distribution over various lighting positions. Now we need to locate all the components precisely in 3D space in the same coordinate system.

With measurements of the rig already made, we now need to find the location of the camera. This camera calibration requires many images of a known checkerboard pattern held at different locations and angles with respect to the camera. Here is a montage of such images.

Using these images, we can get a reasonably precise estimate of the camera's location in space. In particular, we find the location of the hypothetical 'pinhole' of the camera, a point from a useful mathematical model of a camera. Here is a visualization of the camera's position relative to all the checkerboards.

Now we know the position of the camera, light, and glitter. The only thing that stands between us and finding the surface normals of all the specs of glitter is some code. A few lines of MATLAB later, here is a diagram of rays of light leaving the (green) monitor, bouncing off the (blue) sheet of glitter, and then hitting the (small blue) pinhole of the camera.

These lines are a sample of the real data. That is to say, for any given spec in this diagram, the light from the monitor is being drawn from the position on the monitor which most illuminates that glitter spec (makes it sparkle the brightest!). (Recall that we found these lighting positions by finding which vertical bar and horizontal bar on the monitor brightened most each spec: the intersection of the bars gives a point on the monitor.) With this information, we can find the surface normals of the glitter specs, assuming they all behave like little flat mirrors. Here is a diagram of some randomly sampled specs' surface normals.

it is reasonable to expect that the surface normals on average point forward, back towards the monitor and camera. That said, we can look closer at their distributions. The horizontal specs tend to point slightly left, and the vertical ones tend to point slightly up. This makes sense, since the camera is positioned slightly left of center, and significantly above the center of the glitter sheet. So, these distributions of surface normal direction reflect not only the true underlying distribution of glitter spec surface normals but also the position of the light sources and camera.

Finally, we can now test our glitter characterization (whether our spec locations and surface normals are accurate) by predicting what an image should look like for a given light/camera position. In this next diagram, the red dot on the monitor corresponds to a small circular light source we displayed on the monitor. We trace the lights path from that light source to the glitter specs, reflect the light on the specs and then trace it back out into the world. A small fraction of those traced rays pass within a few millimeters of the camera position (still the smaller blue dot above the monitor). In this diagram, only those reflect rays which do pass near the camera pinhole are shown.

We can then record which of these specs did in fact reflect the light to the camera and then predict what the resulting image should look like by applying a homography (inverse of the one we used before) to get from the glitter coordinate system back to the image coordinate system. We actually captured an image with a small dot of light on the screen at the position of the red dot in the last diagram, but be warned that our prediction is expected to be awful for a few reasons I'll get into next. But here is (zoomed in) a section of predicted sparkle on the left and captured sparkle on the right.

Don't go searching for matching specs. The prediction is wrong.

First off, our characterization suffered from Camera Bump (technical term in the field of glitter study), and this means that all of our surface normals are incorrect. Even a small move in the camera can result in wildly inaccurate sparkle predictions precisely because of the desirable property of glitter, that its appearance is so sensitive to movements and rotations. This said, we now have the code to make predictions about sparkle patterns, which will be useful in future characterizations.

So, at the end of Week 3, we have completed the full glitter characterization pipeline. That is, we can capture set of images of a sheet of glitter and then run code to find the location and surface normal of hundreds of thousands of specs of glitter on the sheet. Now we will get a fresh capture of images (we built a stable camera mount to avoid camera bump from our previous unwieldy plastic tripod), and we will hopefully get some satisfying images showing that our characterization accurately predicts the sparkle pattern of the glitter.

Stay sparkly!

Bloopers: Sometimes, it can take a while to get many measurements and computations to all agree on the same coordinate system, scale, etc...

Detecting Specs – Glitter Week 2 1

obroadrickJune 7, 20221 Comment

This week I was headed home Thursday to celebrate my sister's graduation from high school. Monday, I set out to address the issue of overlapping glitter specs when detecting specs and finding their centroids. As per the comment of Professor Pless, we may be able to distinguish overlapping specs in the max image by recording the index of the image from which they came (which lighting position made the spec brighten). Here is a grayscale image showing for each pixel the index of the image for which that pixel was brightest during the sweep. Looks nice.

We can zoom in to see how much variation there is within a single glitter spec. We expect (hope) that there is little to none since the lighting position that makes a glitter spec brighten the most is likely to make each and every pixel the brightest with only little deviation. So we now plot a histogram of the variations (or ranges) of index values in the regions.

As expected, most regions have very little variation in image index across all the pixels (less than 8), but some of the regions have much higher variation. Of course, I’m hypothesizing that regions with high index variation tend to be regions containing more than one glitter spec. Sure enough, if we look at just centroids of regions with index variation above threshold 10, we get the following image.

This is great. Now we are detecting some of the regions which accidentally contain multiple nearby/overlapping specs. For threshold 8, these specs represent about 8.7% of the total specs detected. We could throw away such specs, but it would be nice to save them. Note, however, that since each of these specs are hypothesized to have two or more true specs in them, it is actually a larger proportion of the total number that exist. They are worth saving.

To save the overlapping specs, we compute the average image index (frame) from which the pixels in a region originate, and we then divide the region into two regions: one with pixels with index above that average and one below. Then we compute new centroids for each of these regions. Here is an example of a ‘bad’ centroid with its updated two centroids according to this approach.

Here is another view, showing now the original centroids in red and the new centroids in blue.

Since some of the original regions actually contain 3 or more true specs, we could extend this approach by looking for multiple peaks in the distribution of frame indexes. But, as a first significant improvement, this is great. Given a max image, these centroids can now be found pretty well and quite quickly. Given the sets of images from multiple sweeps of light bars at different angles, we can find them for each set, match them up, and take averages to improve accuracy.

For the remainder of this short week, I improved the software behind glitter characterization. A newbie to Matlab, I got some advice from Professor Pless and broke up components of the software in a reasonable way. We got a laser level to line up the rig more precisely and make new measurements too. A fresh set of images were captured, glitter specs detected, means of lighting position gaussians found, and saved. Now we need to get accurate camera position with traditional camera calibration techniques (checkerboards) and get the homography from the characterization images to the glitter plane.

New Glitter Apparatus (05/23/2022)

adellariMay 28, 2022Leave a comment

This week, @Oliver joined the team and brought some new energy to the project, specifically directed at re-imagining our glitter capturing rig! Last week, we wrote the rudimentary bar display and glitter capture scripts and this week, they got packaged into a more user-friendly and hassle-free executable. This can now be ran over a range of calibration images as shown in figure 1. The calibration program can also now control camera parameters such as aperture, focal zoom, and more features are being added as we see fit. Currently, its structure allows us to define how big of a range of our 760 images of the sweeping white line we want to display. For example, we can call the program as "Program 0-760 2" where it will iterate over the entire catalogue of vertical lines, calling every 2nd image and thus having our vertical line move by 10 pixels each photo (2 x 5pixels).

We found it was also very useful to control as many camera parameters as possible given the new capture setup is very optically isolated and still a little flimsy as far as our occlusion solution goes (it will no longer be loose curtains in the near future : p). For the first time this week, we also ran the full calibration sequence and arrived at some incredibly Gaussian glitter specs. See O. Broadrick's post for this week!

Moving on, we decided to set up our sights on obtaining a reasonable camera projection matrix with our fiducial markers. Thankfully, opencv has some really handy packages for dealing with Aruco codes. We printed chessboard patterns of different sizes and took images at different projections with our chAruco printout in view, taking special care not to move the camera. Figure 2 shows some of the images we took for camera calibration.

With our collection of 25 chAruco images, we performed our camera calibration, with the notion in mind that this would be our proof-of-concept run. Regardless, we were sure to fix our camera aperture, zoom and shutter timing - all of which we found running experimental trials of our camera capture software and comparing images. We now have a camera matrix which we can use to derive glitter surface normals!

This weekend, we'll be working on capturing new images in our defined setup, both with vertical and horizontal line sweeps. To accomplish this, we'll create a new set of images with vertical line sweeps - additionally, we're trying to develop a way to remotely control our entire apparatus so we've purchased a continuous battery for the camera.