Keywords: RL, scaling, reasoning, inference-time scaling, uncertainty
Abstract: Large language models (LMs) are typically post-trained via RL to produce a single best answer per query, implicitly optimizing for modal correctness. While effective for benchmark accuracy, this approach is unideal for many applications of interest such as in medical diagnosis, which would benefit from models generating a set of plausible answers (ideally paired with uncertainty estimates). This paper describes a multi-answer reinforcement learning (RL) approach for enabling LMs to do this, where we modify the RL objective to train models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. We instantiate this approach through Multi-Answer Reinforcement Learning with Verifiable Rewards (Multi-RLVR), which generalizes ordinary RLVR to the multi-answer case with a set-level reward. We further extend this approach to Multi-Answer Reinforcement Learning with Calibrated Rewards (Multi-RLCR), which adds a set-level Brier score-based calibration objective to enable LMs to output calibrated uncertainty estimates associated with each answer in the output set. Multi-answer training promotes explicit representation of alternative hypotheses rather than repeated generation of the dominant mode. Across question-answering and medical diagnostic benchmarks, we observe improved diversity, recall, and set-level calibration scores compared to sampling multiple times from single answer-trained baselines. We further observe that models trained with our approach are more token-efficient, requiring fewer tokens to generate multiple answers than competing approaches. These results position multi-answer RL as a performant and efficient alternative to inference-time scaling.
Submission Number: 98
Loading