\section{Introduction}
\label{sec:intro}

Large language models (LLMs) and vision-language models (VLMs) show strong performance across diverse domains, including mathematics \citep{snell2024scaling}, robotics \citep{li2023behavior}, autonomous driving \citep{qian2024nuscenes}, and scientific research \citep{xu2023chartbench, roberts2024scifibench}. A central factor in these advances is their ability to perform \emph{reasoning}. While conventional deep learning models are often treated as black-box predictors, outputting a final result without rationale, recent large reasoning models can articulate intermediate steps explaining how answers are derived.

Explicit multi-step reasoning methods, such as chain-of-thought (CoT) prompting \citep{wei2022chain, temsah2024openai}, produce step-by-step explanations that enhance problem-solving in domains like arithmetic and symbolic reasoning. Early studies have applied CoT-style prompting to medical tasks \citep{liu2024medcot, tu2025towards}, aligning with clinical practice where physicians sequentially observe, interpret, and diagnose. However, the effectiveness of multi-stage reasoning typically depends on fine-tuning with large collections of annotated reasoning processes. In medicine, such annotations require domain experts and are costly to obtain, making supervised approaches impractical. This scarcity motivates approaches that \emph{do not rely on fine-tuning}, which are promising for extending large reasoning models (LRMs) to medical domains without extensive supervision.

Zero-shot prediction with LRMs, however, often yields suboptimal performance. To address this, \emph{test-time scaling} (TTS) has recently emerged as a promising inference paradigm. The key idea is to allocate additional computation during inference to improve a model’s reasoning ability. A common strategy is ``parallel thinking'', where multiple candidate outputs are sampled and aggregated, rather than relying on a single generated output (i.e., single-pass decoding) \citep{yao2023tree,snell2024scaling}. These approaches, ranging from majority voting \citep{wang2022self} to verifier-based selection \citep{cobbe2021training, uesato2022solving, want2024shepherd, lightman2024lets}, have demonstrated strong performance in domains requiring complex reasoning, such as mathematics. 

Directly transferring TTS methods, particularly those leveraging verifiers, to medical applications presents significant challenges. Verifiers, known as reward models, are often unavailable in the medical domain because their training requires vast amounts of labeled reward data. For example, Qwen-PRM \citep{zhang2025lessons}, a reward model used for mathematical reasoning, required 4.5 million labels for its training. Therefore, TTS in medicine has focused on reward-free inference schemes, such as self-consistency decoding \citep{singhal2025toward}, mostly on textual benchmarks.

Despite these efforts, our understanding of TTS in medicine remains limited. Key open questions include: \textbf{can these methods extend beyond text to multimodal medical decision making tasks?} and \textbf{under what conditions does TTS improve performance?} Motivated by these limitations, this work presents a comprehensive investigation of TTS for medical decision-making tasks, introducing a framework that enhances performance without requiring additional supervision or specialized reward models.


Our key contributions are summarized as follows:
\begin{enumerate}[leftmargin=1.25em,itemsep=2pt,topsep=2pt]
    \item We investigate inference strategies (direct answering and CoT) and introduce a two-stage reasoning framework for medical decision making, where a VLM produces textual descriptions that are aggregated by an LLM for diagnosis.
    
    \item We evaluate TTS on test-time inference strategies and show consistent improvements, with gains of up to 30.4 percentage points over single-pass baselines.
    
    \item We present a characterization of TTS, deriving scaling laws that describe how performance improves with the number of samples and identifying conditions under which TTS yields reliable gains.
\end{enumerate}
