Dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP)

Animesh Danayak
4 min readJan 22, 2021

How our practicum team aims to tackle the high dimensionality problem in customer segmentation for the world’s largest self-storage company.

Businesses all over the world shell out millions of dollars each year to be able to segment their customers and make decisions for each cohort tailored to their needs. They collect a plethora of features on their customers to stay ahead in the competition. This, however, introduces what is known as the curse of dimensionality. High-dimensionality is when there are too many input variables. Contrary to the common notion of “more the better”, high dimensional data introduces complexity in the analysis and is often problematic. Parsimony in models (and data) is hence sought and there are various techniques that have been developed to tackle this issue. Dimensionality reduction is a powerful tool in machine learning to visualize and understand large, high-dimensional datasets. The side-effects of high dimensional data are that the models run on such data tend to overfit.

It is not hard to believe that the marketing department of one of the largest self-storage companies (our practicum partner) in the world will be “cursed” with the problem of dimensionality. They have a treasure trove of data available on their customers and are oriented towards utilizing it in making data-driven decisions. The abundance of features introduces complexity to the process. Dimensionality reduction techniques to the rescue. UMAP is a novel technique for dimensionality reduction introduced in McInnes et al. that offers a faster algorithm along with sustaining the global structure of the data.

In this blog, we briefly touch upon:

  1. the theory behind UMAP in order to better understand how the algorithm works
  2. how to use it effectively
  3. how this will impact our analyses and insights

How does UMAP work?

Figure: 2-D UMAP projection of 10,000 data points in 3-D

Source: 3-D model from the Smithsonian institute| 2-D projection using the source code available here

UMAP optimizes a low-dimensional graph after constructing a graphical representation of the data while retaining (most of) the global structure of the data. This is important because humans comprehend low-dimensional data much better. For its high-dimensional graph, UMAP uses a “fuzzy simplicial complex”. The algorithm then optimizes the low dimensional structure while ensuring that the local structure is preserved in balance with the global structure. To dive deeper into the mathematics behind the functioning of UMAP readers might find this helpful: Understanding UMAP.

How do we use UMAP effectively?

Although extremely useful in visualizing high-dimensional data, UMAP is, in no way, the elixir and to interpret its results effectively the following takeaways are key:

  1. Projections vary with hyperparameters and tuning them multiple times give a better sense of how projections change
  2. The relative size of the clusters in the UMAP plot has no meaning
  3. The relative distance between the clusters might not have any meaning
  4. Random noise does not “look” random always
  5. We may need to run the plot multiple times since the algorithm is stochastic

How will this impact our analyses?

While interviewing the stakeholders of our practicum project, we identified that the primary goal of this project is to segment customers into cohorts to help the Marketing department drive decisions based on their customers’ needs. While customer segmentation is fairly commonplace in the analytics industry, what makes the job hard is to communicate the output in a manner that can be deciphered easily. UMAP helps not just with creating the segments but also plays the role of interactive “storytelling” with data.

Conclusion

UMAP is an incredibly powerful tool in a data scientist’s arsenal and offers numerous advantages over traditional dimensionality reduction techniques. Dimensionality reduction techniques have a factor of loss associated with them by the virtue of a trade-off between complexity vs explainability. These methods are effective in communicating the results to a non-technical audience since they distort the data to fit it into lower dimensions and humans are able to perceive data in lower dimensions better. However, by building up an intuitive understanding of how the algorithm works and understanding how to tune its parameters, we can more effectively use this powerful tool to visualize and understand large, high-dimensional datasets.

--

--

Animesh Danayak

A jack of a lot of trades only to find out that Data Science requires this!