Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

ACL ARR 2026 January Submission2101 Authors

01 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text Embeddings, Anisotropy, Information Retrieval, Semantic Shift
Abstract: Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe what these pathologies look like, yet provide limited insight into when and why they harm downstream retrieval. In this work, we argue that the missing causal factor is semantic shift: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of semantic smoothing in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.
Paper Type: Long
Research Area: Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other areas
Research Area Keywords: polysemy, lexical semantic change, compositionality, multi-word expressions, word embeddings, semantic textual similarity
Contribution Types: Theory
Languages Studied: English
Submission Number: 2101
Loading