Geometry-Preserving Dimensionality Reduction for Text Embeddings

Published: 02 Jun 2026, Last Modified: 02 Jun 2026Greeks in AI 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dimensionality Reduction, Text Embeddings, Geometry Preservation, Post-hoc Compression, Embedding Evaluation
Domains: Language and Learning
TL;DR: A simple distance-preserving linear projection compresses text embeddings by 4× while retaining nearly all of their downstream utility.
Abstract: Dense text embeddings are a core representation in modern NLP, supporting tasks such as retrieval, clustering, classification, and semantic search. However, embeddings often have hundreds or thousands of dimensions, creating substantial storage and efficiency challenges at scale. In this work, we present a systematic study of post-hoc dimensionality reduction methods for text embeddings across multiple modern embedding backbones, compression ratios, and downstream tasks. We introduce GeoPres, a simple geometry-preserving reduction method: a learned linear map trained to preserve pairwise distances in the original embedding space-motivated by the Johnson-Lindenstrauss lemma from metric geometry. Our experiments show that embedding dimensionality can often be substantially reduced with minimal downstream task performance loss, and that GeoPres outperforms competing methods across many settings. We further find that preserving internal similarity rankings strongly correlates with downstream utility, providing a useful proxy for evaluating reduction quality. Overall, our results offer practical recommendations for selecting dimensionality reduction techniques in text embedding models.
Submission Number: 210
Loading