Keywords: medical, reasoning, list, answer, reinforcement learning, supervised fine-tuning, prompting, chain-of-thought
TL;DR: We study how medical reasoning models can produce ranked lists for open-ended questions. Comparing prompting, SFT, and RFT, we find RFT more robust, highlighting ranked lists as a promising alternative to single answers.
Abstract: This paper presents a systematic study on enabling *medical* reasoning
models (MRMs)--which achieve SOTA performance on multiple-choice
benchmarks--to remain robust when producing alternative *answer
formats*. Answer formats define the structure of a final answer in a
generated response, such as an option, free text, or a ranked list.
Although clinical decision-making typically involves weighing multiple
plausible possibilities, current MRMs are trained to produce only one
answer, and their robustness beyond that format is not well studied. We
focus on the *ranked-list* format as an alternative that better reflects
clinical uncertainty. To address this gap, we evaluate *prompting* and
*fine-tuning* for enabling MRMs to generate ranked lists across common
medical benchmarks. While prompting provides a lightweight solution,
MRMs vary widely in their ability to follow such instructions. We
therefore explore supervised fine-tuning (SFT) and reinforcement
fine-tuning (RFT) as stronger adaptation methods. SFT trains models to
imitate ranked outputs, whereas RFT optimizes behavior through reward
functions; we introduce new rewards tailored to ranked-list generation
and analyze their effects through ablations. Our results show that
although some SFT models handle certain formats well, RFT yields more
consistent robustness across multiple answer formats. A case study on a
modified MedQA benchmark with multiple valid answers further reveals
that MRMs can recognize clinically sound alternatives even when
misaligned with a benchmark's preferred ground truth. To the best of our
knowledge, this is the first systematic investigation of adapting MRMs
to alternative answer formats such as ranked lists. We hope this study
lays the foundation for developing more flexible and clinically aligned
MRMs.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 10856
Loading