About the high similarity on conv1 with Abby's Mask, my thought is that the average pooling makes them same. I think for natural images, the value of pixels does share some distribution. For each single filter in the conv1, the results still share a same distribution. Then the global average of the output is around the excepted value of the distribution.
So I compared with different scale of downsampling of the output of conv1. The 16*16 result is using the upsampled mask. (The origin output dim of conv1 is 128*128*64)
From the above plots, after reducing the downsampling scale, the peak of the similarity goes lower and moves left.