Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models

18 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: medical, reasoning, list, answer, reinforcement learning, supervised fine-tuning, prompting, chain-of-thought
TL;DR: We study how medical reasoning models can produce ranked lists for open-ended questions. Comparing prompting, SFT, and RFT, we find RFT more robust, highlighting ranked lists as a promising alternative to single answers.
Abstract: This paper presents a systematic study on enabling *medical* reasoning models (MRMs)--which achieve SOTA performance on multiple-choice benchmarks--to remain robust when producing alternative *answer formats*. Answer formats define the structure of a final answer in a generated response, such as an option, free text, or a ranked list. Although clinical decision-making typically involves weighing multiple plausible possibilities, current MRMs are trained to produce only one answer, and their robustness beyond that format is not well studied. We focus on the *ranked-list* format as an alternative that better reflects clinical uncertainty. To address this gap, we evaluate *prompting* and *fine-tuning* for enabling MRMs to generate ranked lists across common medical benchmarks. While prompting provides a lightweight solution, MRMs vary widely in their ability to follow such instructions. We therefore explore supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) as stronger adaptation methods. SFT trains models to imitate ranked outputs, whereas RFT optimizes behavior through reward functions; we introduce new rewards tailored to ranked-list generation and analyze their effects through ablations. Our results show that although some SFT models handle certain formats well, RFT yields more consistent robustness across multiple answer formats. A case study on a modified MedQA benchmark with multiple valid answers further reveals that MRMs can recognize clinically sound alternatives even when misaligned with a benchmark's preferred ground truth. To the best of our knowledge, this is the first systematic investigation of adapting MRMs to alternative answer formats such as ranked lists. We hope this study lays the foundation for developing more flexible and clinically aligned MRMs.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 10856
Loading