I wanted to understand something about CLIP features for Re-ID, the process of tracking someone for a while, then re-recognizing them (for example, when they come into view of the next camera).
The wb-wob-reid-dataset dataset has examples of hundreds of people, captured when they are wearing the same clothes (but from different viewpoints), but also wearing different clothes. In this post I'm not concerned with training CLIP, but just using it and trying to understand how well CLIP features already support the re-id task.
For the record, I used the openai/clip-vit-base-patch32 model, and computed embedding features for all the both_small/bounding_box_test images. Something like 10 images can be embedded per second on the CPU in my laptop, so all this runs pretty quickly.
First, in the vein of "visualize everything", here is the t-SNE plot of 2000 images of 250 people. Each person is a class, and the images of the same person (or class) have the same color
I can't understand that so well. You can maybe convince yourself that there are clusters of the same color (?) but not clear, and we also aren't taking advantage of the extra label that the dataset has, which is which images of a person have the same clothes. So I wrote code to iterate through the images and mark all the images of a person, with different tags for different outfits. Then I can show all the images as gray, with images from one person highlighted, color coded by the label of their outfit.
Sometimes this shows clusters corresponding to a particular person, in all outfits, like the super unique "Person 64":
But more often you have multiple clusters per person:
and the most common is actually that the images of the person are pretty spread out across the space:
I always like to better understand my data, so let's look at the these three people. First, the unique person 64, seems to have the same dress on, and be in the same kind of background in each picture:
person 65, also pretty unique clothes, but sometimes more hidden by the backpack:
and person 66, with some bright red style (?) flourishes, that were still not enough to have the CLIP features pull these images all together.
Now, the next step (coming soon), is to see if we can find good parts of the CLIP features space that will help automatically merge the features from the same people, so images of one person aren't spread out across the whole space.