MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

ACL ARR 2025 May Submission1016 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAI’s o-series models highlight test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks. This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings. In this work, we present a small-scale medical reasoning system, \mone, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm. Starting with 8,000 instances sampled with a curriculum strategy spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct rule-verifiable reasoning chains for two iterations. Each reasoning step is scored by the rollout estimation, allowing for training the policy model and a soft dual-sided process reward model (PRM). Experiments on eleven evaluation datasets demonstrate that \mone outperforms not only the prior strongest medical model by 6.45, but also 32B-level general reasoning models by 8.57 points.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: biomedical QA, reasoning

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources, Data analysis

Languages Studied: English

Submission Number: 1016

Loading