GRACE: Generative Representation Learning via Contrastive Policy Optimization

Jiashuo Sun; Shixuan Liu; Zhaochen Su; Xianrui Zhong; Pengcheng Jiang; Bowen Jin; Peiran Li; Weijia Shi; Jiawei Han

GRACE: Generative Representation Learning via Contrastive Policy Optimization

Jiashuo Sun, Shixuan Liu, Zhaochen Su, Xianrui Zhong, Pengcheng Jiang, Bowen Jin, Peiran Li, Weijia Shi, Jiawei Han

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Text Representation, Reinforcement Learning

TL;DR: GRACE reimagines contrastive learning as reward-guided generative reasoning, turning LLMs into interpretable embedders that generate explicit rationale traces. It boosts MTEB performance by up to 11.5% while preserving general capabilities.

Abstract: Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black-box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce \GRACE{} (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy $\pi_\theta$ that produces explicit, human-interpretable rationales—structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query--positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross-category gains: averaged over four backbones, the supervised setting improves overall score by 11.5\% over base models, and the unsupervised variant adds 6.9\%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent decision traces.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 22739

Loading