Stanford's 231n course, which covers some of the most important parts of using CNNs for Visual Recognition, sure provides an extensive amount of information on the subject matter. Although I'm only slowly going through the course, here are some interesting things I've learned so far (they are probably very basic for you, but I'm quite ~thrilled~ to be learning all of this):
- SVM and Softmax are very comparable types of classifiers, and they can both perform quite well for a variety of classification tasks. However, Softmax provides more ease of use as it outputs probabilities for each in a given image, rather than giving arbitrary values like SVM. Although those arbitrary values also allow for quite simple evaluation, probabilities are much more preferred in most cases.
- However, even a small change from SVM/Softmax to a simple 2-layer NN can improve results dramatically.
- The loss function essentially computes the "unhappiness" with the final results of the classification. Triples Loss is an example of such function, but the course (or at least its beginning) focuses on hinge loss.
- Hinge Loss is mostly used for SVMs.
- Hinge Loss penalises predictions at zero max(0,−), which acts as a margin.
- Sometimes the squared Hinge Loss is used, which penalises predictions more strictly/strongly.
A part of my learning came from studying and running the "toy example" they explain in the course. Here are the steps I undertook to perform the example (they had some sample code but I adapted it to myself and changed things around a little):
- Generate a sample "swirl" (spiral) dataset that would look like this and would contains of 3 classes -- red, blue, and yellow:
- Train a Softmax classifier.
- 300 2-D points (--> 300 score points with 3 scores each (for each colour))
- Calculate cross-entropy loss
- Calculate & evaluate the accuracy of the training set, which turns out to only be 53%
- Train a 2-layer neural network
- Two sets of weights and biases (for the first and second layers)
- size of the hidden layer (H) is 100
- the only change from before is one extra line of code, where we first compute the hidden layer representation and then the scores based on this hidden layer.
- The forward pass to compute scores
- 2 layers in the NN
- use ReLu activation function -- we’ve added a non-linearity with ReLU that thresholds the activations on the hidden layer at zero
- Backpropagate all layers
- Calculate & evaluate the accuracy of the training set, which turns out to be 98% yay!
- Two sets of weights and biases (for the first and second layers)
- Results
Please find my iPython notebook with all the notes and results on GitHub.
I love this last picture. It highlights
(a) a two layer network makes things out of a collection of linear boundaries (any depth network does this, but you ca see it still with two layers), and
(b) two layers is enough to make pretty interesting shapes even in 2-D!