Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Published: 26 Jan 2026, Last Modified: 26 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text embedding, LLM, representation learning
Abstract: Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core gener- ative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iter- atively refine semantic representations. By producing sequences of soft tokens optimized under a contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield bet- ter representations. Extensive experiments show that GIRCSE outperforms strong LLM- based embedding baselines on the MTEB embedding benchmark. Moreover, GIRCSE ex- hibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5041
Loading