Escaping the Mode: Multi Answer Reinforcement Learning in LMs

Isha Puri; Mehul Damani; Idan Shenfeld; Marzyeh Ghassemi; Jacob Andreas; Yoon Kim

Escaping the Mode: Multi Answer Reinforcement Learning in LMs

Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim

Published: 02 Mar 2026, Last Modified: 05 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RL, scaling, reasoning, inference-time scaling, uncertainty

Abstract: Large language models (LMs) are typically post-trained via reinforcement learning (RL) to produce a single best answer per query, implicitly optimizing for modal correctness. While effective for benchmark accuracy, this approach is misaligned with reliability-critical settings, such as medical diagnosis, where models must reason under uncertainty and avoid overconfident hallucinations. We propose a multi-answer reinforcement learning (RL) framework that trains LMs to explicitly generate sets of plausible answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. We instantiate this approach through Multi-Answer Reinforcement Learning with Verifiable Rewards (Multi-RLVR), which generalizes standard RLVR to a set-level reward. We further extend this approach to Multi-Answer Reinforcement Learning with Calibrated Rewards (Multi-RLCR), which incorporates a set-level Brier score objective to produce calibrated uncertainty estimates for each answer. By encouraging explicit representation of alternative hypotheses rather than repeated generation of a dominant mode, multi-answer training mitigates overconfident single-answer failure modes commonly associated with hallucination. Across question-answering and medical diagnostic benchmarks, we observe improved diversity, recall, and set-level calibration compared to inference-time sampling from single-answer-trained baselines, while requiring fewer total tokens. These results position multi-answer RL as a reliable and efficient alternative to inference-time scaling for uncertainty-aware and agentic decision-making.

Submission Number: 78

Loading