Dimensionality reduction is a powerful tool for machine learning practitioners to visualize and understand large, high dimensional datasets. One of the most widely used techniques for visualization is t-SNE, but its performance suffers with large datasets and using it correctly can be challenging.
UMAP is a new technique by McInnes et al. that offers a number of advantages over t-SNE, most notably increased speed and better preservation of the data's global structure. In this article, we'll take a look at the theory behind UMAP in order to better understand how the algorithm works, how to use it effectively, and how its performance compares with t-SNE.
yarn yarn dev
yarn dev:cech yarn dev:hyperparameters yarn dev:mammoth-umap yarn dev:mammoth-tsne yarn dev:supplement yarn dev:toy yarn dev:toy_comparison
For the mammoth figures, the raw 3D data was downsampled to 50,000 points before being projected with UMAP / t-SNE. These 50,000 points were then randomly subsampled to 10,000 points in order to minimize the payload size.
Understanding UMAP uses a few tricks to make the data payloads for some of the interactive figures small enough to download in a reasonable time. The
mammoth figures use a 10-bit encoding scheme to compress the 10,000 data points into a significantly smaller payload. The
toy_comparison figures precompute UMAP embeddings for all of their different combinations, then use the same 10-bit encoding scheme to compress the data.
yarn preprocess:hyperparameters yarn preprocess:mammoth yarn preprocess:toy_comparison