1. Nearest Neighbor Classifier

NN (Nearest Neighbor Classifier):

This classifier has nothing to do with Convolutional Neural Networks and it is very rarely used in practice, but it will allow us to get an idea about the basic approach to an image classification problem.

One of the simplest possibilities is to compare the images pixel by pixel and add up all the differences. In other words, given two images and representing them as vectors I1,I2 , a reasonable choice for comparing them might be the L1 distance:

Where the sum is taken over all pixels. Here is the procedure visualized:

There are many other ways of computing distances between vectors. Another common choice could be to instead use the L2 distance, which has the geometric interpretation of computing the euclidean distance between two vectors. The distance takes the form:

kNN (k - Nearest Neighbor Classifier):

Instead of finding the single closest image in the training set, we will find the top k closest images, and have them vote on the label of the test image.

2.Linear Classification

The simplest possible function, a linear mapping:

In the above equation, we are assuming that the image xi has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the parameters of the function. In CIFAR-10, xi contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). The parameters in W are often called the weights, and b is called the bias vector because it influences the output scores, but without interacting with the actual data xi.

The single matrix multiplication Wxi is effectively evaluating 10 separate classifiers in parallel (one for each class), where each classifier is a row of W.

Assume the image only has 4 pixels, and that we have 3 classes (red (cat), green (dog), blue (ship) class). Then we stretch the image pixels into a column and perform matrix multiplication to get the scores for each class.

3.Loss Function

SVM (Multiclass Support Vector Machine):

The SVM loss is set up so that the SVM“wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ. The Multiclass SVM loss for the i-th example is then formalized as follows:

Softmax classifier:

The Softmax classifier gives a slightly more intuitive output (normalized class probabilities). A cross-entropy loss that has the form:

The function inside log is called the softmax function.

SVM vs Softmax:

The SVM only care about the score of the correct class higher than any other classes, there could be many W exist. But the Softmax can make the probability of correct class more and more close to 1.

4.SGD

The loss function lets us quantify the quality of any particular set of weights W. The goal of optimization is to find W that minimizes the loss function.

Strategy #1: Random search

Very bad idea.

Strategy #2: Random Local Search

Try to extend one foot in a random direction and then take a step only if it leads downhill.

Strategy #3: Following the Gradient

It turns out that there is no need to randomly search for a good direction: we can compute the best direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descend (at least in the limit as the step size goes towards zero). This direction will be related to the gradient of the loss function.

Gradient Descent

Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called Gradient Descent. Its vanilla (Vanilla means standard, usual, or unmodified version of something. Vanilla gradient descent means the basic gradient descent algorithm without any bells or whistles.) version looks as follows:

This simple loop is at the core of all Neural Network libraries. There are other ways of performing the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and established way of optimizing Neural Network loss functions.

Mini-batch gradient descent. In large-scale applications (such as the ILSVRC challenge), the training data can have on order of millions of examples. Hence, it seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update. A very common approach to addressing this challenge is to compute the gradient over batches of the training data. For example, in current state of the art ConvNets, a typical batch contains 256 examples from the entire training set of 1.2 million.

The extreme case of this is a setting where the mini-batch contains only a single example. This process is called Stochastic Gradient Descent (SGD) (or also sometimes on-line gradient descent). This is relatively less common to see because in practice due to vectorized code optimizations it can be computationally much more efficient to evaluate the gradient for 100 examples, than the gradient for one example 100 times.

kNN, SVM, Softmax and SGD