Skip to content

1

About Data Set: 

I have been working on classification of a Kaggle plant seedling dataset with 12 classes, here are some manually picked examples from each class  :

Black Grass:

Charlock:

Cleavers:

Common Chickweed:

Common Wheat:

Fat Hen:

Loose Silky Bent:

Maize:

Scentless Mayweed:

Shepherd's purse:

Small Flowered Cranesbill:

Sugar Beet:

A ResNet18 pre-trained on imageNet have been fine tuned on this data set, achieving about 99% prediction accuracy.

Deep dream x ResNet18:

This week, I used deep dream to visualize each layer of the network, here is some result I find interesting:

Original image and Maximizing 'add layer' following stage2, 3, 4:

Spiral (maize seedling) and grey vertical line (bar code?) are encoded in stage 2; Star-like shape (intersect of thin leaves) and green color are encoded in stage 3; Line, curves and angles are encoded in stage 4;

Comparing with the result of mixed4c layer in Inception V3:

No higher level structure about plants emerged in any layer no matter how I change the parameters. (Probably due to the monotony of the dataset, no high level structure is necessary to classify the dataset?)

Input, 2 conv layers, output of Stage2 unit2:

 

Input, 2 conv layers, output of Stage3 unit2:

The output becomes the ''weighted mixture'' of mainstream and shortcut. (Can this explain the high performance of ResNet?)

Class Activation Map:

I also tried the class activation map of some randomly picked data samples, most of the sample have expected Heat map like this:

Several pictures have rather unexpected activation map:

For the sample on the left, the flower-like leaves should be a very good indication of clever class, but the network only looks at cotyledons. For the samples on the right, the network ignore the leaves on the center.

 

Given two images that are similar, run PCA on their concatenated last conv layer.

We can then visualize the top ways in which that data varied. In the below image, the top left image is the 1st component, and then in "reading" order the importance decreases (bottom right is the 100th component).

[no commentary for now because I ran out of time and suck.]

Sometimes I like being a contrarian.  This paper (https://arxiv.org/pdf/1904.13132.pdf) suggests that you can train the low levels of a Deep Learning network with just one image (and a whole mess of data augmentation approaches, like cropping and rotating etc.).  This contradicts a huge amount of belief in the field that the reason to pre-train on Imagenet is that having a large number of images makes for a really good set of low level features.

I'm curious what other assumptions we can attack, and how?

One approach to data augmentation is to take your labelled data, and make *more* labelled data by flipping the images left and right and/or crop it and use the same label for the new image.  Why are these common data augmentation tools?  Because often flipping an image left right (reflecting it), or slight crops result in images that you'd expect to have the same label.

So let's flip that assumption around.  Imagine training an binary image classifier with many images that are labelled either (original or flipped).  Can a deep learning network learn to tell if something has been flipped left/right?  And if it does; what has it learned?  Here is an in-post test.  For these three images (the first three images that I see when I looked at facebook today), either the top or the bottom has been flipped from the original.  Can you say which is the original in each case?

[(Top, bottom, bottom)]

Answers available by highlighting above.

What cues are available to figure this out?  What did you use?  Could a network learn this?   Would it be interesting to make such a network and ask what features in the image it used to come to its conclusion?

What about the equivalent version that considers image crops?  (binary classifier: is this a cropped "normal picture" or not?  Non binary classifier: Is this cropped from the top left corner or the normal picture?  the top right corner?  the middle?)

What are other image transformations that we usually ignore?

 

2

Continue the idea of triplets scatter. I plot the scatter after each training epoch for both training and testing set to visualize how the triplets move during the training. I put the animation in this slide.

These scatter plots is trained on resnet50 on Stanford Online products dataset.

https://docs.google.com/presentation/d/19l5ds8s0oBbbKWifIYZFWjIeGqeG4yw4CVnQ0A5Djnc/edit?usp=sharing

The difference of 1st order and 2nd order EPSHN is on the top right corner and the right boundary after 40 epochs training. There are more dots in that area for 1st order EPSHN rather than 2nd order EPSHN.

But this visualization still doesn't show the difference of how these two method affect on the triplets.

I pick a test image and its closest positive and closest negative after training and check the corresponding triplet. I draw its moving path during the training.

 

The blue dot is the origin imageNet similarity relation and the green dots the the final position of 1st order EPSHN and the red dot is the final position of 2nd order EPSHN.

I find 2nd EPSHN move the triplet dot closer to the ideal position, (1,0), after several testing image checks.

 

Then I try to do the statistics of the final dot distance to the point(1,0) with both method. I also draw the histogram of the triplets with imageNet initialization(pretrain).The following plot show the result.

About the high similarity on conv1 with Abby's Mask, my thought is that the average pooling makes them same. I think for natural images, the value of pixels does share some distribution. For each single filter in the conv1, the results still share a same distribution. Then the global average of the output is around the excepted value of the distribution.

So I compared with different scale of downsampling of the output of conv1. The 16*16 result is using the upsampled mask. (The origin output dim of conv1 is 128*128*64)

From the above plots, after reducing the downsampling scale, the peak of the similarity goes lower and moves left.

https://www.youtube.com/watch?v=5ResNQwydQg

 

The internet is a strange place.  I've talked about using reaction videos as a set of free labels (Learn a Deep Network to map faces to an embedding space where images from the same time in an aligned video are mapped to the same place).  Why is that good?  There are lots of works on emotion recognition, but it is largely limited to "Happy" or "Sad" or "Angry", and often recognition works with pretty extreme facial expressions.  But real expressions are more subtle, shaded, and interesting.  But usually nobody uses them because there are no labels.  We don't have strong labels, but we have weak labels (these images should have the same label).

And, lucky us!  Someone has made a reaction video montage, like the one above, aligning all the videos already!  (crazy!).

Not just one, here is another:

https://www.youtube.com/watch?v=u_jgBySia0Y

and, not just 2, but literally hundreds of them:

https://www.youtube.com/channel/UC7uz-e_b68yIVocAKGdN9_A/videos

 

 

In order to figure out how yoke tsne work and dig deep into it. In this experiment, I run two tsne from two different initialization and yoke them together by different lambda.

This experiment is running on training set of CAR196 dataset.

This result is very similar to my previous experiment which yoke different embedded data together (Npair vs Proxy).

I realized the embedding result on ICCV submission is a little weird. For CAR data set, Npair loss R@1 acc on testing set is 53% and EPSHN is 72%. I feel this might be related to the initial learning rate. Since in all the tests in the paper, I set the learning rate with 0.0005.

I try to run tests on two approaches EP(easy positive) and EPSHN(easy positive with semi-hard negative) with the incremental initial rate from 0.0001, 0.0002, 0.0004, 0.0008 to 0.0016.

And I get this result.

It looks like EPSHN can afford a large learning rate for a better result, but the EP cannot afford the large learning rate.