Skip to content

Classes are (almost) over and I'm back!

I ran a Semantic Segmentation algorithm on Hotels 50K, and visualised the results here! Some quick notes:

  • For the purpose of this test I only used unoccluded images
  • I ran it locally on my computer for about 12 hours, and it only processed 1068 images, but it should be enough for now to see how it works.
  • I just got access to the servers, and my next step is to figure out how to use them properly and run the segmentation there.
  • This visualisation is slightly ugly (I'm not an HTML expert), so please forgive me for that.
  • The details about the algorithm can be found here
    • I had to modify the code though as they were using CUDA, which is a parallel computing platform that uses the GPU, and after two days of trying to set it up on my Mac I found out Mac doesn't support that at all 🎉
    • That, however, made it work slower than it could have with CUDA support

As you can see from the visualisation, some images were segmented correctly. However, the algorithm failed for quite a lot of them:

  • Clear, high quality, and simple images were segmented correctly
  • The algorithm saw some bright light and shadows as objects
  • Since those images were taken by visitors, a lot of them were very poorly taken, so those didn't work well
  • In certain images with, for example, a crumbled blanket, the algorithm did not detect that blanket as one whole object, but rather as many little pieces. Again, the shadows are probably what tricked the algorithm into thinking it was many little objects
  • Images that were flipped in one way or another didn't work as well -- a very major issue
  • Although in a lot of images objects were detected correctly, their actual "category" is not always correct

 

During the past week or so I've been studying/reading/trying to implement dropout in a Convolutional Neural Network.

Dropout is a quite simple yet a very powerful regularization technique. It was first officially introduced in this paper, and it essentially involves keeping a neuron active with some probability, or setting it to zero otherwise. In other words, at each training epoch, certain nodes are either dropped out of the net with probability 1-p or kept with probability p. Dropout could also be explained as a mechanism to simply drop a (previously defined) percentage of random values from the training dataset (that is, set their vectors to be 0), only updating the parameters of the sampled network based on the input data. It could be used both for text and image analysis. Dropout, however, is never used on the test dataset. Its main end-goal is to reduce overfitting and therefore achieve greater accuracy.

The paper linked above had a great visualisation of how dropout can operate (see below). The image to the left shows a standard neural network with 2 hidden layers, and the image on the right shows the same network but with certain (random) neurons dropped.

 

Overall, as we can see from the image below, when training a neural network on the CIFAR-10 dataset and using dropout value 0.75, the accuracy may improve in some cases, depending on the number of epochs. Dropout does help increase the validation accuracy, but not very dramatically. The training accuracy, however, seems to be generally better when using no dropout at all.

I have used dropout in the past when working with text (Keras makes it much easier to use it), and the end result is always very similar. In order to figure out the dropout value that would actually dramatically improve performance and accuracy  it is very important to test out a variety of dropout values and number of epochs. All of those things also highly depend on the subject matter of the dataset and its size, so there is never one universal value. When working with binary text classification problems (e.g. sentiment analysis) though, a lower value of dropout seems to have a tendency to produce better results.

 

Note: Stanford has removed almost all assignments from the course website as they have restarted it again. Luckily, I had saved some of them, but also did lose access to quite a large portion, hence why I'm still on Assignment 2.

For the past week or so I have been trying to work through Assignment 2 in Stanford's cs231n course on Neural Networks. The assignment focuses on the following aspects:

  • Understanding Neural Networks and how they are arranged.
  • Understanding and being able to implement backpropagation.
  • Implementing various update rules used to optimize Neural Networks.
  • Implementing Batch Normalization and Layer Normalization for training deep networks.
  • Implementing Dropout to regularize networks.
  • Understanding the architecture of Convolutional Neural Networks and getting practice with training these models on data.
  • Gaining experience with a major deep learning framework, such as TensorFlow or PyTorch.

I had worked on the last part in the previous blog post. This time, however, I went into exploring how different layers are formed in a Neural Network when it is, in fact, built from scratch. In short, I explored ways to implement:

  • a forward pass
  • a backward pass
  • a forward pass for a ReLU activation function
  • a backward pass for a ReLU activation function

Any pass in the network allows to move along the layers. The forward pass of a fully-connected layer corresponds to one matrix multiplication followed by a bias offset and an activation function. The backward pass moves back along the network to calculate the changes in weights.

It is interesting how the instructors chose ReLU activation function considering how many of them exist. I, however, have seen ReLU used a lot, and it seems to have proven to produce better results than others. On the other hand, it can also be quite fragile. Thankfully implementing ReLU simply involves calculating the following: f(x)=max(0,x), which means that activation is simply thresholded at zero.

I am still working my way through the assignment and will continue making posts as I make more progress.

As mentioned in the previous blog post about CS 231n, it covers some of the most important aspects of CNNs for visual recognition (and thankfully to me, starts from the basics).

The previous post went through a toy example of a simple two-layer neural network and how its performance differs from that of a linear softmax classifier, which was very helpful in understanding the basics of neural networks. The further lectures, modules, notes and assignments of the course go into more advanced topics of the subject matter. I'm still trying to figure out exactly what is happening in every step described in the tutorials, but due to the density of the information I try to focus on topics that seem to be more applicable and useful for the future. One of such topics is PyTorch, which is a system for executing dynamic computational graphs over Tensor objects that behave similarly as numpy ndarray. In short, it allows to eliminate some of the steps we otherwise would have to complete manually. Being unfamiliar with this library, I chose to jot some things down about it in this blog post (and I might go further in-depth into Tensorflow in the next post). PyTorch is a library I hadn't heard of before starting to work in the lab, so pretty everything about it is new to me.

Here are some things I noted about PyTorch:

  • When using a framework like PyTorch you can harness the power of the GPU for your own custom neural network architectures.
  • When creating a neural network from scratch you have to download the dataset, preprocess it, and iterate through it in minibatches. PyTorch, similarly to Tensorflow, automates and therefore simplifies this process.
    • This is usually done through the torchvision package.
  • For a simple fully-connected ReLU network with 2 hidden layers and no biases PyTorch allows to compute the forward pass using operations on PyTorch Tensors, and uses PyTorch Autograd to compute gradients.
    • A PyTorch Tensor is similar to a numpy array: it is an n-dimensional grid of numbers.
    • The tensor needs to be flattened before being passed onto the network to be 2-dimensional.
  • A 2-layer neural network can be trained in PyTorch in much less lines of code than when created from scratch!

1

Stanford's 231n course, which covers some of the most important parts of using CNNs for Visual Recognition, sure provides an extensive amount of information on the subject matter. Although I'm only slowly going through the course, here are some interesting things I've learned so far (they are probably very basic for you, but I'm quite ~thrilled~ to be learning all of this):

  • SVM and Softmax are very comparable types of classifiers, and they can both perform quite well for a variety of classification tasks. However, Softmax provides more ease of use as it outputs probabilities for each in a given image, rather than giving arbitrary values like SVM. Although those arbitrary values also allow for quite simple evaluation, probabilities are much more preferred in most cases.
  • However, even a small change from SVM/Softmax to a simple 2-layer NN can improve results dramatically.
  • The loss function essentially computes the "unhappiness" with the final results of the classification. Triples Loss is an example of such function, but the course (or at least its beginning) focuses on hinge loss.
    • Hinge Loss is mostly used for SVMs.
    • Hinge Loss penalises predictions at zero max(0,−), which acts as a margin.
    • Sometimes the squared Hinge Loss is used, which penalises predictions more strictly/strongly.

A part of my learning came from studying and running the "toy example" they explain in the course. Here are the steps I undertook to perform the example (they had some sample code but I adapted it to myself and changed things around a little):

  1. Generate a sample "swirl" (spiral) dataset that would look like this and would contains of 3 classes -- red, blue, and yellow:
  2. Train a Softmax classifier.
    1. 300 2-D points (--> 300 score points with 3 scores each (for each colour))
    2. Calculate cross-entropy loss
      • Compute probabilities
      • Compute the analytic gradient with backpropagation to minimise the cost
      • Adjust parameters to decrease the loss
    3. Calculate & evaluate the accuracy of the training set, which turns out to only be 53%
  3. Train a 2-layer neural network
    1. Two sets of weights and biases (for the first and second layers)
      • size of the hidden layer (H) is 100
      • the only change from before is one extra line of code, where we first compute the hidden layer representation and then the scores based on this hidden layer.
    2. The forward pass to compute scores
      • 2 layers in the NN
      • use ReLu activation function -- we’ve added a non-linearity with ReLU that thresholds the activations on the hidden layer at zero
    3. Backpropagate all layers
    4. Calculate & evaluate the accuracy of the training set, which turns out to be 98% yay!
  4. Results
    • Softmax
    • 2-layer Neural Network

Please find my iPython notebook with all the notes and results on GitHub.

4

Step-by-step description of the process:

  • Load the data in the following format

  • Create an ID for each hotel based on hotel name for training purposes
  • Remove hotels without names and keep hotels with at least 50 reviews
  • Load GloVe
  • Create sequence of embeddings and select maximum sentence length (100)

  • Select anchors, positives, and negatives for the triplet loss. Anchor and positive have to be reviews coming from the same hotels, whereas the negative has to be a review from a different hotel
  • ||f(A) - f(P)||^2 <= ||f(A) - f(N)||^2   if A=anchor, P=positive, N=negative
    d(A,P) <= d(A,N)
  • LOSS = max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + α, 0)
  • COST = sum of losses for the trainings set of different triplets
  • Train a model with triplet loss in 5 epochs
  • Plot training and validation loss

  • Potentially having more epochs could reduce the loss
  • However, based on the test results we can see that triplet loss doesn't make any valuable results. Especially since we are looking at just one location, it is easier for reviews to cluster based on things they mention rather than on the hotels they are based on