TL;DR: We tested our previous text-only consistency score on two more datasets: Stanford Dogs and Flower102. We compared it to Xiaotong's vision-text score, which is computed based on generated images.
It worked very well and robustly (hopefully, there are no bugs in my code)!
Work in Progress: Visualization
Recall that our text-only score intuitively evaluates the degree to which the CLIP model understands a concept and how well it separates the target class from neighboring classes.
I want to create some visualizations to see what the classes with the lowest text consistency scores can tell us:
In the following figures, I plot the images of the class with the lowest text consistency score and the class they are most likely to be predicted as, in a left vs. right manner. Additionally, I indicate their accuracy and text consistency score above each pair.
Ideally, we want to determine why a class is misclassified: is it because the CLIP model doesn't understand the concept of the class, or because it can't distinguish the class from its neighboring classes? Or maybe both?
Example of Lack of Knowledge
Let's start with a patch of plot of the Flower102 dataset (The complete high-resolution plot is available at: https://github.com/ywugwu/CSCI-6212/blob/main/_posts/imgs2/Flower102_lowest_type_1_plus_2_plus_3_text_consistency_score.png?raw=true):
We plot the images of the class that has the lowest text consistency score vs the images that're mostly predicted by CLIP:
The two flowers do not look similar, but CLIP most likely predicts the left red flowers as the right purple flowers. Intuitively, CLIP understands the right purple flower quite well, as it has a high text consistency score and accuracy. In addition, the red flower doesn't look similar to the purple flower. Seemingly, the left red flower is misclassified because CLIP can't understand what it is.
Example of Appearance Confusion (Need More Scores to Support)
Next, let's look at an example from the Stanford Dogs dataset: these two dogs are, in my opinion, misclassified because they are so similar in appearance. We can also see their text consistency score, and accuracy are similar.
Our consistency score is composed of summing different text consistency scores: at least we have a semantic understanding score and a class confusion score. I think I should also indicate the separate text consistency scores, as they could explain a lot.
Question: Have you thought about possible ways to combine the text/image prediction?
Are there classes where one is clearly better?
Can finding cases where one is clearly better suggest if the problem is in the CLIP understanding or in the dataset?
Great post!
For q1, I think a simple weighted average of image prediction and text prediction can help.
For q2, answer is "no". My text score vs accuracy plot (https://raw.githubusercontent.com/ywugwu/CSCI-6212/main/_posts/imgs/image.png) is very similar to Xiaotong's image version plot. I can't see there are any special points.
For Q3, my answer is "I don't know, but probably more in the dataset". The reason is my text score is computed based on how well CLIP understands the concepts plus how well CLIP can separate the classes in the dataset. And it turns out the latter contributes more to the correlation with accuracy. I'll organize my code/visualization and publish a new post to discuss it soon.