Partition First, Embed Later: Laplacian-Based Feature Partitioning for Refined Embedding and Visualization of High-Dimensional Data

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We present a feature partitioning approach for embedding and visualizing multiple low-dimensional structures within high-dimensional data
Abstract: Embedding and visualization techniques are essential for analyzing high-dimensional data, but they often struggle with complex data governed by multiple latent variables, potentially distorting key structural characteristics. This paper considers scenarios where the observed features can be partitioned into mutually exclusive subsets, each capturing a different smooth substructure. In such cases, visualizing the data based on each feature partition can better characterize the underlying processes and structures in the data, leading to improved interpretability. To partition the features, we propose solving an optimization problem that promotes graph Laplacian-based smoothness in each partition, thereby prioritizing partitions with simpler geometric structures. Our approach generalizes traditional embedding and visualization techniques, allowing them to learn multiple embeddings simultaneously. We establish that if several independent or partially dependent manifolds are embedded in distinct feature subsets in high-dimensional space, then our framework can reliably identify the correct subsets with theoretical guarantees. Finally, we demonstrate the effectiveness of our approach in extracting multiple low-dimensional structures and partially independent processes from both simulated and real data.
Lay Summary: Modern scientific datasets are often large and complex, containing thousands of high-dimensional samples---that is, each sample is described by many different features or attributes. To make sense of such data, researchers use embedding and visualization techniques to reduce its dimensionality, enabling them to identify patterns and structures more easily. But when the data originates from multiple unknown underlying processes---each affecting different sets of features---a single visualization can mix these processes, making the results difficult to interpret. In this paper, we tackle this problem by developing a method that separates the features into groups, with each group reflecting a different underlying process. A separate embedding or visualization for each group provides a clearer view of the underlying processes, helping to disentangle and highlight the different factors at play. We also provide theoretical guarantees for successful recovery when the underlying processes are independent or even partially dependent, and we demonstrate the approach’s effectiveness on both simulated and real scientific datasets.
Link To Code: https://github.com/erezpeter/Feature_Partition.git
Primary Area: General Machine Learning->Unsupervised and Semi-supervised Learning
Keywords: data visualization, dimensionality reduction, manifold learning, data embedding, feature partitioning
Submission Number: 1917
Loading