TL;DR: We tested our previous text-only consistency score on two more datasets: Stanford Dogs and Flower102. We compared it to Xiaotong's vision-text score, which is computed based on generated images.
It worked very well and robustly (hopefully, there are no bugs in my code)!
Work in Progress: Visualization
Recall that our text-only score intuitively evaluates the degree to which the CLIP model understands a concept and how well it separates the target class from neighboring classes.
I want to create some visualizations to see what the classes with the lowest text consistency scores can tell us:
In the following figures, I plot the images of the class with the lowest text consistency score and the class they are most likely to be predicted as, in a left vs. right manner. Additionally, I indicate their accuracy and text consistency score above each pair.
Ideally, we want to determine why a class is misclassified: is it because the CLIP model doesn't understand the concept of the class, or because it can't distinguish the class from its neighboring classes? Or maybe both?
Example of Lack of Knowledge
Let's start with a patch of plot of the Flower102 dataset (The complete high-resolution plot is available at: https://github.com/ywugwu/CSCI-6212/blob/main/_posts/imgs2/Flower102_lowest_type_1_plus_2_plus_3_text_consistency_score.png?raw=true):
We plot the images of the class that has the lowest text consistency score vs the images that're mostly predicted by CLIP:
The two flowers do not look similar, but CLIP most likely predicts the left red flowers as the right purple flowers. Intuitively, CLIP understands the right purple flower quite well, as it has a high text consistency score and accuracy. In addition, the red flower doesn't look similar to the purple flower. Seemingly, the left red flower is misclassified because CLIP can't understand what it is.
Example of Appearance Confusion (Need More Scores to Support)
Next, let's look at an example from the Stanford Dogs dataset: these two dogs are, in my opinion, misclassified because they are so similar in appearance. We can also see their text consistency score, and accuracy are similar.
Our consistency score is composed of summing different text consistency scores: at least we have a semantic understanding score and a class confusion score. I think I should also indicate the separate text consistency scores, as they could explain a lot.