Keywords: Text embedding, LLM, representation learning
Abstract: Existing large language model (LLM)-based embeddings typically adopt an encoder-only
paradigm, treating LLMs as static feature extractors and overlooking their core gener-
ative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive
Sentence Embeddings), a novel framework that leverages autoregressive generation to iter-
atively refine semantic representations. By producing sequences of soft tokens optimized
under a contrastive objective, GIRCSE captures latent concepts and implicit semantics
that encoder-only methods often miss. To guide this process, we propose an Iterative
Contrastive Refinement (ICR) objective that encourages each refinement step to yield bet-
ter representations. Extensive experiments show that GIRCSE outperforms strong LLM-
based embedding baselines on the MTEB embedding benchmark. Moreover, GIRCSE ex-
hibits an emergent test-time scaling property: generating more tokens at inference steadily
improves embedding quality. Our results establish generative iterative refinement as a new
paradigm for representation learning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5041
Loading