Keywords: Vision Language Model, Test-Time Scaling, Reasoning, Medical Imaging and Diagnosis
TL;DR: We propose a two-stage zero-shot framework with test-time scaling that enhances LLM-based reasoning for medical image diagnosis, improving both accuracy and reliability across multiple modalities.
Abstract: As a cornerstone of modern healthcare, artificial intelligence is expected to support diverse medical tasks, with large language models (LLMs) offering a promising path to enhanced capabilities. The proficiency of LLMs in text-based tasks has not yet translated to their widespread application for reasoning-based diagnosis in medical imaging. This gap is exacerbated by the impracticality of supervised fine-tuning for clinical reasoning tasks, owing to limited data availability and high annotation costs. In this work, we introduce a fine-tuning-free framework for medical image diagnosis that enhances reasoning through test-time scaling (TTS). Our approach operates in two stages: given either visual or textual inputs, candidate representations or reasoning steps are generated, and aggregated through a self-consistency decoding strategy to yield robust final predictions. This framework avoids the need for expensive supervision while leveraging additional inference-time computation to improve reliability. We provide a analytical justification---deriving scaling laws that characterize when and how TTS yields reliable gains---and a comprehensive empirical evaluation across medical benchmarks spanning textual and visual modalities. Results demonstrate consistent improvements over single-pass inference baselines, with performance gains of up to 30 percentage points, highlighting the potential of TTS as a practical pathway toward trustworthy medical reasoning without specialized reward models or domain-specific fine-tuning.
Submission Number: 127
Loading