On the study of the shape of language: A Topological Analysis of Universal Text Encoder Embedding Spaces

NeurIPS 2025 Workshop NeurReps Submission82 Authors

29 Aug 2025 (modified: 29 Oct 2025)Submitted to NeurReps 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: text encoders, persistent homology, persistence diagram, persistence landscape, Wasserstein distance, l_2 norm
TL;DR: Topological data analysis to understand the common representation spaces generated by text encoders. Topological study of the Strong Platonic Representation Hypothesis
Abstract: Encoders have become fundamental to the development of Natural Language Processing tasks and other machine learning domains. However, the incompatibility of their generated embedding representation spaces constitutes a significant challenge for their productive use in real world applications. Methodologies such as \textit{vec2vec} are highly important in this context, raising questions about the possibility of generating a universal embedding representation space. In this article, persistence homology is employed to study the common representation spaces generated by \textit{vec2vec} from various text encoders, combining models with identical and different backbones across multiple datasets. Our methodology utilizes topological data analysis tools, such as persistence diagrams and persistence landscapes, along with distance metrics such as the Wasserstein distance or the $l_2$ norm to objectively evaluate the similarity of the generated representations. Our findings indicate that while the translated space produced by \textit{vec2vec} is effective in terms of cosine similarity, the translation process is shown to introduce topological artifacts. Besides, a statistically significant correlation was not found between geometric and topological metrics. Furthermore, no statistical evidence was observed to suggest that a common backbone leads to a more robust preservation of topological features. Finally, it was determined that a common representation space with similar topological characteristics is generated by \textit{vec2vec} across different encoders, although the degree of similarity is shown to be dependent on the specific dataset used for training.
Submission Number: 82
Loading