Skip to content

Blog Post:

I’ve done lots of work on “embedding” over the last 20 years.  In the early 2000’s this was called “manifold learning” and the goal was to do something like t-SNE:

Given a collection of images (or other high-D points) --- that you *think* have some underlying structure or relationship that you can define with a few parameters --- map those images into a low-dimensional space that highlights this structure.

Some of my papers in this domain are here.  

https://www2.seas.gwu.edu/~pless/publications.php?search=isomap

This includes papers that promote the use of this for temporal super-resolution, using that structure to help segmentation of biomedical imagery, and modifications to these algorithms if the low-dimensional structure had cyclic structure (for example, if the images are of an object that has rotated all the way around, you want to map it to a circle, not a line).  

Algorithms like Isomap, LLE, t-SNE and UMAP all follow this model.  They take as input a set of images and map those images to a 2 or 3 dimensional space where the structure of the points is, hopefully, apparent and useful.  These algorithms are interesting, but they *don’t* provide a mapping from any image into a low-dimensional space, they all just map the specific images you give them into a low dimensional space.  It is often awkward to map new images to the low dimensional space (what you do is find the closest original image or two close original images and use their mapping and hope for the best).

These algorithms also assume that for very nearby points, simple ways of comparing points are effective (for example, just summing the difference of pixel values), and they try to understand the overall structure or relationship between points based on these trusted small distances.  They are often able to do clustering, but they don’t expect to have any labelled input points.

What I learned from working in this space is the following:

  1. If you have enough datapoints, you can trust small distances.  (Yes, this is part of the basic assumption of these algorithms, but most people don’t really internalize how important it is).
  2. You *can’t possibly* have enough datapoints if the underlying structure has more than a few causes of variation (because you kinda need to have an example of how the images vary due to each cause and you get to an exponential number of images you’d need very quickly).
  3. You can make a lot of progress if you understand *why* you are making the low-dimensional embedding, by tuning your algorithm to the application domain.

What does this tell us about our current embedding work?  The current definition of embedding network has a different goal:  Learn a Deep Learning network f that maps images onto points in an embedding space so that:

|f(a) - f(b)| is small if a,b are from the same category

and

|f(a) - f(b)| is large of a,b are from different categories.

This differs from the previous world in two fundamental ways:

  1. We are given labels that help tell us where our points should map to (or at least constraints on where they map), and
  2. We care about f(c) ... where do new points map?
  3. Usually, we don’t expect the original points to be close (in terms of pixel differences or other *easy* distance functions in the high dimensional space.

So we get extra information (in the form of labels of which different images should be mapped close to each other), at the cost of our input images maybe not being so close in their original space, and requiring that the mapping work for new and different images.

Shit, that’s a bad trade.

So we don’t have enough training images to be able to trace out clean subsets of similar images based on follow super similar images (you’d have to sample cars of many colors, in front of all backgrounds, with all poses, and all collections of which doors and trunks are open)

So what can you do?

Our approach\ (Easy-Positive) is something like “hope for the best”.  We force the mapping to push the closest images of any category towards each other, and don’t ask the mapping to do more than that.  Hopefully, this allows the mapping to find ways to push the “kinda close” images together. Hopefully, new data from new classes are also mapped close together.

What are other approaches?  Here is a just-published to ArXiv paper that tries something: https://arxiv.org/pdf/1904.03436.pdf.  They say, perhaps there are data augmentation steps (or ways of changing the images) that you *really* don’t believe should change your output vector.  These might include slight changes in image scale, or flipping the image left/right, or slight skews of the image.

You could take an *enormous* number of random images, and say “any two images that are a slight modification of each other” should have exactly the same feature vector, and “any two images that are different should have different feature vectors.

This could be *added* to the regular set of images and triplet loss used to train a network and it would help force that network to ignore the changes that are in your data augmentation set.  If you care most about geometric transformations, and really like formal math, you could read the following (https://arxiv.org/pdf/1904.00993.pdf).

The other thing we can, or tool that we can use to try to dig deeper into the problem domain, is ask more "scientific" questions.  Some of our recent tools really help with this.

Question: When we “generalize” to new categories, does the network focus on the right part of the image?

How to answer this:

  1. Train a network on the training set.  What parts of the image are most important for the test set?
  2. Train a network on the TEST set.  What parts of the image are most important for the test set?
  3. We can do this and automatically compare the results.
  4. If the SAME parts of the image are important, then the network is generalizing in the sense that the features are computed on the correct part of the image.

This might be called “attentional generalization”.

Questions: If the network does focus does focus on the right part of the image, does it represent objects in a generalizable way?

How to answer this:
Something similar to the above, with a focus on image regions instead of the activation across the whole image.

This we might call “representational generalization" or "semantic generalization”

I finally have clean data again! I found out what was wrong with my image capturing pipeline. When screenshotting your screen, even if the monitor is off, the monitor will "shut off" with your power savings settings on a monitor. So if you are screenshotting for a lengthy amount of time and have a timer on power savings for your monitor, turn it off!

I now have a full pipeline from

capturing images -> training model -> live stream prediction.

For next week, I will be focusing on capturing additional information of the image, such as resources, time, and population of a civilization. For example:

I am deciding between training a ML network to understand the numbers or to parse the font files and parse the screen using a non ML approach!

[This blog post is going to be short...MongoDB has taken over my life]

Last week, our trusty Nikon camera broke...but alas, our new camera (finally) arrived yesterday! Derek and I have tested the camera-capture pipeline using this new camera, and it seems to be a seamless transition from the old camera to the new one. I have taken a few pictures of the glitter using the camera:

 

 

 

 

 

 

 

 

I am displaying a scan line to the glitter here, you can kind of make out the reflection of the scan line on the glitter sheet. I have to play around with the shutter speed and ISO in order to avoid having such a reflection in my images. The most exciting part about all of this though is that I CAN FOCUS THE CAMERA (for those of you who know about my inability to focus the old Nikon, you know how big of a deal this is)!!

I have not had a chance to do any actual processing of these images yet, so look for more exciting updates next week! These include:

  • testing different sizes of scan lines and different gaussians for the scan lines to determine the best one to use
  • playing with the parameters of the camera in order to get the cleanest pictures possible
  • starting to work on and test centroid-detection code

Tuesday and Wednesday I sat in a building in Ballston VA (just a few metro stops away) for the "Phase 2 kickoff meeting" for the Geo-spatial Cloud Analytics program.  James is leading our efforts towards this project, and our goal is to use nearly-real-time satellite data to give updates traversability maps in areas of natural disasters (e.g. Think "Google maps routing directions that don't send you through flood waters).

Special note 1: Our very own Kyle Rood is going to be interning there this summer, working on this project!

Many of the leaders of teams for this project "grew up" the same time I did ... when projects worked with thousands of points and thousands of lines of matlab code.  Everyone is now enamored with Deep Learning.... and share stories about how their high-school summer students do amazing things using the "easy version" of Deep Learning, using mostly off the shelf tools.

The easy version of Deep Learning relies on absurd amounts of data; and this project (which considers satellite data) and absurdly absurd amounts of data.  In the Houston area, we found free LIDAR from 2018 that, when rendered, looks like this:

and zooming in:

This scale of data (about 15cm resolution LIDAR) is available for a swath that cuts across Houston.

This data is amazing!  and there is so much of it.  Our project has the plan to look at flooding imagery and characterize which roads are passible.

We've always known the timeline of this project was very short; but we have an explicit demo deadline of November (which means we need to deliver our code to DZyne much sooner).  So:
(a) we may consider first run options that do less learning and rely more on high-resolution or different data types to start, and

(b) the satellite data is *amazing* and we should think about which of our algorithms might be effective on these datatypes as well.

First of all, I haven't visualize my data so it is a no figure post.

In last few days, I was trying to find how KL distance between high embedding affect lambda selection. And I just run several of experiments.

I get three original tsne and nine yoked tsne for different lambda for following datas.

Proxy to NPAIRS, on TEST  data, for CARS

Proxy to NPAIRS, on TEST  data, for CUB

Proxy to NPAIRS, on TRAIN data, for CARS

Proxy to NPAIRS, on TRAIN data, for CUB

I have get result yet so I don't have a conclusion right now.

 

Second things, I modify tsne module in sklearn. Now, we can use BHt-sne on our yoked method. The time cost from 30 mins decreases to 10 mins for 8000 points.

Data we have:

X: Scans from different day, different plot

Y1 (labels): Date-plot labels

Y2 (features): Small fraction of hand measured data (distributed unevenly on dates)

Target:

Predict the features (leaf length, canopy cover, plant count etc.) for each plot (or scan images).

Brief Structure:

  1. Use scan images and date-plot label to train a CNN that embeds the images to a high-dimension space.
  2. Use the hand measured feature to find a (transformed) subspace of the embedded space that can describe that feature.
  3. Use the space to interpolate the unmeasured images.

Assumption

Hidden features: With some explicit labels, the CNN can somehow learn implicit features, i.e., using embedded space without or with few transformations, we can have a mapping from that space to a linear feature space.

Ideas

  1. Embedding part:
  • The label is continuous instead of class-like. Does embedding a dataset like this with triplet or n-pair seem not reasonable? I think we need a criterion that related to continuous labels.
  • More criterions: Since we have more features instead of only date and plot, is it possible to use more criterions to minimize?
  • Single image as a class: Non-parametric instance-level discrimination. Maybe it could find some interesting structures of the dataset.
  1. If the transformation is not needed:
  • Find the dimension of embedded space that most depends on the feature.
  1. If the transformation is linear:
  • Linear regression based on data points on embedded space and features as target
  • PCA
  1. If the transformation is non-linear:
  • k-NN regression (Not working)

Doing kNN regression has the curse of dimensionality issue. For a single point trying to acquire a feature on embedded space, we need a large k, or it always tends to find the nearest cluster to predict. But we don’t have enough y2 to do so.

  • More layers for different features:

Concatenate some small layers for a single feature (inspired by R-CNN)

  • NICE:

Using NICE on top of the embedded space, then find the directions best describe the features.

An issue of we proposed in the paper discussion is that NICE or GLOW is that it focus on some fixed dimensions, the image that used to demonstrate the GLOW model are all well aligned (the parts (eyes, nose, etc.) of face always appear on approx. same region of image (same dimensions of a vector)). But on embedded space, each dimension should have some fixed meaning. So, is it possible to use this kind of model to map the embedded space into another space that have independent and continuous dimensions?

  • Supervised Manifold learning:

Most manifold learning methods are unsupervised. But is it possible to add some constrain? For example, find a 2D manifold of a 3D sphere that x-axis is based on latitude.

1

Leaf length/width

The leaf length width pipeline was completed for season 6 and I got the result on leaf scale. Some statistic summary (mean and percentile) for some random plots:

The code I ran for slack post do have a typo. So I regenerate these with more percentiles.

The result at the start of the season seems reasonable, but when it goes to the end of the season, the leaf becomes smaller from the result. The higher percentiles do have some good effect on finding the 'complete leaves'. But it also add more noise sometimes at the early season. And still, it's not working for some days.

So I just randomly cropped 4 image from the  late season:

Left top for example, the scanner can only scan one side of the leaves. And most of the leaves are too cureved to see the whole leaves. Which means with the connected component, it is impossible for this kind of situation. To have a better understand of this, I have to debug deeper into the late season. like label all the leaves on the images to see if the algorithm find the proper leaves or not.

Leaf curvature

I implemented the gaussian curvature and mean curvature. While I was doing so, I found the function I used to find principle curvature last week, is for k_2, not k_1. (The largest eigenvalue is at the end instead of start) Which is the smallest one. So I fixed that and here is the result of gaussian curvature of the leaf in pervious post (also fixed the axis scale issue):

And the comparision among 4 leaves of gaussian mean and k1):

Reverse phenotyping:

I implement the after train stage of the reverse phenotyping as a module since I need to deal with much larger(whole month even whole season) data than before (1-3 days).  Then I could easily run them with trained vectors and hand labeled data.

Since I noticed that we have a trained (Npair loss) result of 1 month scans (but never processed for reverse phenotyping), I decide to use that to see what we could get from that (the fatest way to get a result from a larger dataset). Here is some information about the one month data:

  • total data points            676025
  • hand labeled:
  • canopy_cover 2927
    canopy_height 30741
    leaf_desiccation_present 26477
    leaf_length 1853
    leaf_stomatal_conductance 54
    leaf_width 1853
    panicle_height 145
    plant_basal_tiller_number 839
    stalk_diameter_major_axis 222
    stalk_diameter_minor_axis 222
    stand_count 7183
    stem_elongated_internodes_number 4656
    surface_temperature_leaf 1268

A simple tSNE down sampled to 100000 points:

It seems like forming some lines, but it is not converged enough I think.

The live tsne result is on the way!

...some of them just look dynamic. Very convincingly so. Even have documentation on "backend framework", etc.

I learned that this week. Twice over. The first frontend template, which I spent chose for its appearance and flexibility (in terms of frontend components), has zero documentation. Zero. So I threw that out the window because I need some sort of backend foundation to start with.

After another long search, I finally found this template. Not only is it beautiful (and open source!) but it also has a fair amount of documentation on how to get it up and running and how to build in a "backend framework."  The demo website even has features that appear to be dynamic. 4 hours and 5 AWS EC2 instances later, after I tried repeatedly to (in a containerized environment!) re-route the dev version of the website hosted locally to my EC2's public DNS, I finally figured out it isn't. Long story short, the dev part is dynamic---you run a local instance and the site updates automatically when you make changes---but the production process is not. You compile all the Javascript/Typescript into HTML/CSS/etc and upload your static site to a server.

Now, after more searching, the template I'm using is this one, a hackathon starter template that includes both a configurable backend and a nice-looking (though less fancy) frontend. I've been able to install it on an EC2 instance and get it routed to the EC2's DNS, so it's definitely a step in the right direction.

My laundry list of development tasks for next week includes configuring the template backend to my liking (read: RESTful communication with the Flask server I built earlier) and building a functional button on the page where a user can enter a URL. Also, on a completely different note, writing an abstract about my project for GW Research Days, which I am contractually obligated to do.

From my previous posts, we have come to a point where we can simulate the glitter pieces reflecting the light in a conic region as opposed to reflecting the light as a ray, and I think it is more realistic that the glitter is reflecting the light in a conic region. This means that when optimizing for the light and camera locations simultaneously, we actually can get different locations from we pre-determined to be the actual locations of the light and camera. Now, we want to take this knowledge and move back to the ellipse problem...

Before I could get back to looking at ellipses using our existing equations and assumptions, I wanted to first test a theory about the foci of concentric ellipses. I generated two ellipses such that a, b, c, d,  and e were the same for both, but the value of f was different. Then, I chose some points on each of the ellipses and tried to use my method of solving for the ellipse to re-generate the ellipse, which worked as it had in the past.

I then went to pen & paper and actually used geometry to find the foci of the inner ellipse:

I found the two foci to be at about (-13, 7.5) and (11, -7.5). Now, using these foci, I calculated the surface normals for each of the points I had chosen on the two ellipses (so pretend the foci are a light and camera). In doing so, I actually found that the calculated surface normals for some of the points are far different from the surface normals I got using the tangent to the curve at each point:

The red lines indicate the tangent to the curve at the point, while the green vector indicates the surface normal of the point if the light and camera were located at the foci (indicated by the orange circles).

Similarly, I calculated and found the foci for the larger ellipse to be at (-15.5, 9) and (13.5, -9), and then calculated what the surface normals of all the points would be with these foci:

Again, the red lines indicate the tangents and the green lines indicate the calculated surface normals.

While talking to Abby this morning, she mentioned confocal ellipses, and it made me realize that it is possible that there is a difference between concentric and confocal ellipses. Namely, I think that confocal ellipses don't actually share the same values of a,b,c,d,e...maybe concentric ellipses share these coefficients with each other. And I think that is where we have been misunderstanding this problem all along. Now I just have to figure out what the right way to view the coefficients is...:)