After our less than stellar results last week, we talked to Dr. Pless, and decided to pivot the goal of our project.
Instead, our new model will be designed to take 2 images from the same traffic camera, and output the time that has passed between the 2 frames. (ex. if one is taken at 12:30:34, the next at 12:30:36, we should get an output of 2seconds),
The reasoning behind this is, in order to be able to distinguish time differences, the network must learn to recognize the moving objects in the images (ie. cars, trucks, etc.). This way, we are forcing it to learn what vehicles look like, keep track of colors, keep track of vehicle sizes, etc, without having to label every single one individually.
In order to accomplish this, we need to learn about embedding, so that we can create 2 feature vectors, which should represent the similarity between images. This can then be used to train the network to actually detect time differences.
What we know about embedding
We know that an embedding maps an image to a vector in high dimensional space, which will represent different features of the image. Similar images will map to similar places, thus we can use this to gauge similarity.
We found some examples of embedding networks online to learn a bit about how they work. One example used ResNet50, pretrained on imagenet. To get the embedding vector, it passes an image through this network up to the 'avg_pool' layer. This layer output is considered the embedding.
We understand that, because this net is trained on image classification, it must learn some features about the image, and getting an intermediate layer output should show us the 'high dimensional space vector'.
What we don't understand is: what do we train our embedding network on? It seems that there is some initial task that creates a net that will create weights that relate to the objects in the image. Our final task will be to get the time diff between 2 images, but I don't believe we can train our network initially for this task. If we did try this, and were successful in just training a net that takes 2 images as input, then we wouldn't need the embedding (maybe we would still use it for visualization?). But we believe we need some initial task to train a network about our images, that will make it learn the features in some way first. Then, we can use some intermediate layer of this network to extract the embedding, which then could be passed to some other network that takes the vector as input, and whose output will be this time diff.
We also gathered some images from a higher framerate camera (we gathered at ~3images/second). We needed these over AMOS cameras because we need to detect smaller scale time differences, 30 min would be way too long, and any cars in the image would be long gone.