Skip to content

CS231n — Dropout

During the past week or so I've been studying/reading/trying to implement dropout in a Convolutional Neural Network.

Dropout is a quite simple yet a very powerful regularization technique. It was first officially introduced in this paper, and it essentially involves keeping a neuron active with some probability, or setting it to zero otherwise. In other words, at each training epoch, certain nodes are either dropped out of the net with probability 1-p or kept with probability p. Dropout could also be explained as a mechanism to simply drop a (previously defined) percentage of random values from the training dataset (that is, set their vectors to be 0), only updating the parameters of the sampled network based on the input data. It could be used both for text and image analysis. Dropout, however, is never used on the test dataset. Its main end-goal is to reduce overfitting and therefore achieve greater accuracy.

The paper linked above had a great visualisation of how dropout can operate (see below). The image to the left shows a standard neural network with 2 hidden layers, and the image on the right shows the same network but with certain (random) neurons dropped.

 

Overall, as we can see from the image below, when training a neural network on the CIFAR-10 dataset and using dropout value 0.75, the accuracy may improve in some cases, depending on the number of epochs. Dropout does help increase the validation accuracy, but not very dramatically. The training accuracy, however, seems to be generally better when using no dropout at all.

I have used dropout in the past when working with text (Keras makes it much easier to use it), and the end result is always very similar. In order to figure out the dropout value that would actually dramatically improve performance and accuracy  it is very important to test out a variety of dropout values and number of epochs. All of those things also highly depend on the subject matter of the dataset and its size, so there is never one universal value. When working with binary text classification problems (e.g. sentiment analysis) though, a lower value of dropout seems to have a tendency to produce better results.

 

Note: Stanford has removed almost all assignments from the course website as they have restarted it again. Luckily, I had saved some of them, but also did lose access to quite a large portion, hence why I'm still on Assignment 2.

Leave a Reply

Your email address will not be published. Required fields are marked *