Embedding Distance as a Reward Signal can replace Verifiers for LLM Reasoning

Abdelhakim Benechehab; Youssef Attia El Hili; Albert Thomas; Giuseppe Paolo; Maurizio Filippone

Embedding Distance as a Reward Signal can replace Verifiers for LLM Reasoning

Abdelhakim Benechehab, Youssef Attia El Hili, Albert Thomas, Giuseppe Paolo, Maurizio Filippone

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: LLM, Reasoning, reward model, Fine-tuning

TL;DR: We propose a reward modelling approach based on distances in an embedding space, and show that achieves similar performance as groundtruth rewards for LLM reasoning.

Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for adapting Large Language Models (LLMs), offering advantages over Supervised Fine-Tuning (SFT) including reduced catastrophic forgetting and improved generalization. However, these benefits require explicit reward signals, often obtained from human preferences or verifiable outcomes, which are unavailable in many cases. We address this gap by introducing a framework that derives reward functions directly from supervised data, enabling RL-based training without additional annotation. Our approach formulates reward functions as a weighted distance between embeddings of labels and generated answers. Experiments with LLMs fine-tuning for a reasoning task demonstrate that our learned rewards match the performance of oracle RL that has access to groundtruth rewards.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 93

Loading