I spent a bit of time this last week working on the "are these similarity maps from the same class or different class" classifier. As a first pass of getting this running, I took my pair of 8x8 heatmaps, scaled them up to 32x32 and concatenated them in the depth direction to have an input to a CNN that is (32x32x2), with a binary label for whether the pair are from the same class or not. I have a training dataset w/ ~300k pairs, 50% of which are from the same label, 50% from different labels, and a test dataset of ~150k pair, also equally split.
I then train a network with cross entropy loss and am getting roughly 75% training accuracy, and 66% testing accuracy (better than random chance!). But I actually don't think this should work, for a couple reasons. One: you can reasonably imagine the case where you get identical heatmaps with different labels (a pair of images from the same class that focus on the same regions as a pair of images from different classes). Two: actually looking at the images, I kind of don't believe that there are obvious differences to be keying on.
I always like to play the "can a human do this task" game, so for each of the below images, do you think that the images from the same class are on the left or the right? (Answers are below the images in white text)
same on left
same on left
same on right