Escaping the Mode: Multi Answer Reinforcement Learning in LMs
Keywords: RL, scaling, reasoning, inference-time scaling, uncertainty
Abstract: Large language models (LMs) are typically post-trained via reinforcement learning (RL) to produce a single best answer per query, implicitly optimizing for modal correctness. While effective for benchmark accuracy, this approach is misaligned with reliability-critical settings, such as medical diagnosis, where models must reason under uncertainty and avoid overconfident hallucinations.
We propose a multi-answer reinforcement learning (RL) framework that trains LMs to explicitly generate sets of plausible answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. We instantiate this approach through Multi-Answer Reinforcement Learning with Verifiable Rewards (Multi-RLVR), which generalizes standard RLVR to a set-level reward. We further extend this approach to Multi-Answer Reinforcement Learning with Calibrated Rewards (Multi-RLCR), which incorporates a set-level Brier score objective to produce calibrated uncertainty estimates for each answer.
By encouraging explicit representation of alternative hypotheses rather than repeated generation of a dominant mode, multi-answer training mitigates overconfident single-answer failure modes commonly associated with hallucination. Across question-answering and medical diagnostic benchmarks, we observe improved diversity, recall, and set-level calibration compared to inference-time sampling from single-answer-trained baselines, while requiring fewer total tokens. These results position multi-answer RL as a reliable and efficient alternative to inference-time scaling for uncertainty-aware and agentic decision-making.
Submission Number: 78
Loading