After our less than stellar results last week, we talked to Dr. Pless, and decided to pivot the goal of our project.
Instead, our new model will be designed to take 2 images from the same traffic camera, and output the time that has passed between the 2 frames. (ex. if one is taken at 12:30:34, the next at 12:30:36, we should get an output of 2seconds),
The reasoning behind this is, in order to be able to distinguish time differences, the network must learn to recognize the moving objects in the images (ie. cars, trucks, etc.). This way, we are forcing it to learn what vehicles look like, keep track of colors, keep track of vehicle sizes, etc, without having to label every single one individually.
In order to accomplish this, we need to learn about embedding, so that we can create 2 feature vectors, which should represent the similarity between images. This can then be used to train the network to actually detect time differences.
What we know about embedding
We know that an embedding maps an image to a vector in high dimensional space, which will represent different features of the image. Similar images will map to similar places, thus we can use this to gauge similarity.
We found some examples of embedding networks online to learn a bit about how they work. One example used ResNet50, pretrained on imagenet. To get the embedding vector, it passes an image through this network up to the 'avg_pool' layer. This layer output is considered the embedding.
We understand that, because this net is trained on image classification, it must learn some features about the image, and getting an intermediate layer output should show us the 'high dimensional space vector'.
What we don't understand is: what do we train our embedding network on? It seems that there is some initial task that creates a net that will create weights that relate to the objects in the image. Our final task will be to get the time diff between 2 images, but I don't believe we can train our network initially for this task. If we did try this, and were successful in just training a net that takes 2 images as input, then we wouldn't need the embedding (maybe we would still use it for visualization?). But we believe we need some initial task to train a network about our images, that will make it learn the features in some way first. Then, we can use some intermediate layer of this network to extract the embedding, which then could be passed to some other network that takes the vector as input, and whose output will be this time diff.
We also gathered some images from a higher framerate camera (we gathered at ~3images/second). We needed these over AMOS cameras because we need to detect smaller scale time differences, 30 min would be way too long, and any cars in the image would be long gone.
I'm really excited that you guys are going to be exploring this direction -- I think this question of whether the network learns the transient objects is super cool.
Re: "what do we train our embedding network on?"
The idea in training an embedding is to have some loss function which encourages the output features for two images from the same class to be very similar and two images from different classes to be very different. This is different than the sort of classification approach you're describing where the output would be "these images are 2 seconds apart". [It's possible that Robert actually did want you to do the classification version of this, I suppose, but I'm guessing not -- Robert, if that's wrong, feel free to interject! 🙂 ]
There are different loss functions that people use to learn an embedding, but one of the simplest is triplet loss, which takes as input triplets of images: an anchor example, a positive example and a negative example. The anchor and the positive image are from the same class (in your case, a few seconds apart), and the negative example is from a different class (more than a few seconds apart).
The loss function is then very different from a classification network which predicts class likelihood, but rather something computed entirely on these feature vectors:
loss_per_triplet = max(|anchor_feature - positive_feature| - |anchor_feature - negative_feature| + some margin,0)
This loss function says: I want my two examples that are few seconds apart to have more similar features than the anchor and negative examples by some margin. If that criteria is met, then the triplet will not contribute to the loss. This encourages the network to push examples from the same class closer together and examples from different classes further apart.
The hope here is that in order to produce features that are similar for the same scene a few seconds apart but different at the scale of ~20 seconds to a minute, the network will need to learn an encoding of the transient objects in the scene. (If, on the other hand, we were to increase the time difference between the anchor and the negative examples to several hours, the network might instead learn to encode something about the shadows in the scene.)
This set of notes gives a really good intro to triplet loss and different strategies for implementing it: https://omoindrot.github.io/triplet-loss
You should also feel free to talk to Hong about embedding and other loss functions, or questions about implementations, as this is kind of his wheelhouse.
Two thoughts:
1. I didn't think that your results were bad! I think they highlight that a lot of traffic cameras end up with pretty small images of individual cars and that we need to find some way to train something to fix up each camera. And
2. I think that one cheap way to train things per camera would be to do this trick that gets them to automatically learn about similar objects.
Abby did go through what I had in mind. I think it is useful to hear many versions of the same story, so (without re-looking at what Abby said) i'm going to give my thoughts:
You suggested that we could perhaps make something that predicts, from a pair of images, how far apart in time they are. I want to do the easier task of just saying "very close by" in time or "farther apart". But the easiest way to do this is probably what Abby suggested, which is the train a network with triplet loss. Abby shared the specific loss function, so I'll share instead the idea.
you are going to make triplets of images (a,p,n)
("anchor", "positive", "negative"),
where the anchor and positive are, perhaps 3 seconds apart, and the anchor and negative are, say, 1 minute apart.
You are then going to train a network to produce features for each image
f(a), f(p), f(n) (or, in abby's terminology: anchor_feature..... )
penalizing the network if the positive feature is farther from the anchor than the negative feature.
that motivates the loss function abby talked about.
I encourage you to use "Resnet-50" as your network architecture because we use that quite a lot.
One key step is hunting for datasets. You should explore these below and then make a suggestion about which you think might be good and why:
Datasets:
Possible dataset 1:
This has an absurd number of half-hour videos, captured at about 1fps.
http://lost.cse.wustl.edu/
If you get to the point of needing it, the username/password is: "collaborator", "bandwidth" (no quotes in what you type).
Possible dataset 2:
Pick something from here. Use a youtube scraper to get the video into some normal movie format, then either write your code to use that or partition it up into individual images.
https://www.youtube.com/results?sp=EgIYAg%253D%253D&search_query=%22traffic+camera%22