Skip to content

Last week, I find a problem of UMAP which is that if the high graph of a embedding representation is not connected, like NPair result on CAR training dataset, the optimizer of UMAP will keep push each cluster far away, which doesn't matter in visualization, but in TUMAP, we need to measure the loss of each map.

So, we try some different way to avoid or solve this problem.

First, we compute KL distance of normal UMAP with TUMAP result instead of compare their loss.

Second, we try to optimize the repulsive gradient of edge in the high graph instead of each connection of every two points. But the result of this method gonna wired.

And, I try add a 0 vector into the high dimensional vectors, and make it equally very far from each points when constructing high-D graph. It doesn't work.

UMAP paper: https://arxiv.org/abs/1802.03426

Here are some attempt based on a python module, umap-learn

First, we try UMAP on some Gaussian dataset.

(1): We generate two different Gaussian distribution dataset (1000*64,1000*64) in different location (mean), and visualize it by a)random pick two dimension b) umap, c) tsne.

(2): We generate two different Gaussian distribution dataset (1000*64,1000*64) in different scale (std), and visualize it by a)random pick two dimension b) umap, c) tsne.

(2): We generate two different Gaussian distribution dataset (1000*64,1000*64) in different location(mean) and scale (std), and visualize it by a)random pick two dimension b) umap, c) tsne.

 

Then, we try to compare result of t-SNE and UMAP on embedding result of npair and epshn, on CAR dataset.

Npair on training data:

Npair val data:

EPSHN tra data:

EPSHN val data: