Abstract: Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAI’s o-series models highlight test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks.
This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings.
In this work, we present a small-scale medical reasoning system, \mone, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm.
Starting with 8,000 instances sampled with a curriculum strategy spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct rule-verifiable reasoning chains for two iterations.
Each reasoning step is scored by the rollout estimation, allowing for training the policy model and a soft dual-sided process reward model (PRM).
Experiments on eleven evaluation datasets demonstrate that \mone outperforms not only the prior strongest medical model by 6.45, but also 32B-level general reasoning models by 8.57 points.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: biomedical QA, reasoning
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources, Data analysis
Languages Studied: English
Submission Number: 1016
Loading