Embedding Distance as a Reward Signal can replace Verifiers for LLM Reasoning
Track: tiny / short paper (up to 4 pages)
Keywords: LLM, Reasoning, reward model, Fine-tuning
TL;DR: We propose a reward modelling approach based on distances in an embedding space, and show that achieves similar performance as groundtruth rewards for LLM reasoning.
Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for adapting Large Language Models (LLMs), offering advantages over Supervised Fine-Tuning (SFT) including reduced catastrophic forgetting and improved generalization. However, these benefits require explicit reward signals, often obtained from human preferences or verifiable outcomes, which are unavailable in many cases. We address this gap by introducing a framework that derives reward functions directly from supervised data, enabling RL-based training without additional annotation.
Our approach formulates reward functions as a weighted distance between embeddings of labels and generated answers.
Experiments with LLMs fine-tuning for a reasoning task demonstrate that our learned rewards match the performance of oracle RL that has access to groundtruth rewards.
Presenter: ~Abdelhakim_Benechehab1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 93
Loading