m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models

Xiaoke Huang; Juncheng Wu; Hui Liu; Xianfeng Tang; Yuyin Zhou

m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou

Published: 12 Oct 2025, Last Modified: 12 Nov 2025GenAI4Health 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Keywords: Medical, Reasoning, Large Language Models, Test-Time Scaling, Health Care

TL;DR: simple test-time scaling strategy, with minimal fine-tuning, can unlock strong medical reasoning within large langugage models.

Abstract: Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present \textbf{m1}, a simple yet effective approach that increases a model’s medical reasoning capability at inference. Through extensive experiments on open-source LLMs (Qwen2.5, 7B and 32B), we demonstrate that increasing the ``thinking'' token budget consistently improves accuracy without additional model training. Our evaluation across diverse medical tasks demonstrates that test-time scaling significantly enhances medical reasoning, enabling lightweight fine-tuned models to achieve performance comparable to computationally intensive counterparts (e.g., our 32B model matches previous 70B-scale medical LLMs). We identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which controls test-time computation by extending reasoning through iterative prompts (e.g., appending "Wait"), helps models double-check answers but does not necessarily improve overall medical QA performance and, in some cases, introduces errors into previously correct responses. Critically, our analysis highlights insufficient medical knowledge as a primary failure mode, a limitation unresolvable through increased reasoning alone, underscoring the necessity of incorporating medical knowledge. Furthermore, increasing data scale, enhancing data quality, or expanding model capacity consistently improves medical knowledge grounding and thus boosts performance, particularly on challenging medical benchmarks where smaller models reach performance saturation. These findings reveal fundamental differences between medical and mathematical reasoning capabilities in LLMs. All data, code, and models will be publicly available to encourage future exploration in optimizing inference strategies in clinical AI applications.

Supplementary Material: zip

Submission Number: 165

Loading