\section{Related Work}
\label{sec:related}

\subsection{Vision-Language Models in Medical Imaging}

Conventional data-driven deep learning approaches parameterize a model with learnable parameters and train it on a dataset of image--label pairs. Such models are often treated as \emph{black-box} predictors: they provide a final classification result but lack transparency, and the reasoning underlying the diagnostic process is non-interpretable. Given the safety-critical nature of medicine and the risk that generated content may deviate from clinical standards, rigorous evaluation is mandated to assess progress and mitigate harms \citep{johnson2023assessing}. Addressing these interpretability and reliability concerns is paramount for the responsible deployment of AI in clinical settings.

On the other hand, the development of VLMs has rapidly progressed, enabling models to process both images and text for diverse applications, including robotics \citep{li2023behavior}, autonomous driving \citep{qian2024nuscenes}, and scientific research \citep{xu2023chartbench, roberts2024scifibench}. In the medical domain, early efforts focused on foundational tasks such as medical image captioning and VQA on datasets like VQA-RAD \citep{lau2018dataset}. More recently, models like Med-PaLM \citep{tu2024medpalm}, Med-Flamingo \citep{moor2023medflamingo}, and LLaVA-Med \citep{li2023llava} have demonstrated strong performance in generating clinically-relevant text. These efforts are now being pushed further by more recent works such as VILA-M3 \citep{nath2025vila} and MedXpertQA \citep{zuo2025medxpertqa}, which focus on more complex reasoning and comprehensive evaluations. 

The use of multi-stage reasoning, a paradigm that breaks down a complex task into a series of intermediate steps, has gained significant traction as an alternative to end-to-end approaches. This aligns with the clinical workflow, where a clinician first observes an image and other patient information, analyzes the symptoms, and then formulates a diagnosis based on their observations and knowledge. For instance, recent works have explicitly incorporated multi-stage reasoning, such as CoT \citep{liu2024medcot, tu2025towards}, to generate detailed diagnostic rationales and explain their decision-making process. More advanced methods have also emerged, including Tree-of-Thought (ToT) \citep{yao2023tree}, which creates a tree-like structure of potential diagnostic paths and evidence. This allows the model to explore and evaluate multiple hypotheses simultaneously before reaching a final conclusion.

\subsection{Test-Time Compute Scaling}

TTS has become a prominent research area, offering a computationally efficient alternative to traditional retraining for enhancing model performance. By leveraging an increased computational budget at inference time, these strategies improve a model's accuracy and robustness without requiring any changes to its parameters or architecture. CoT  \citep{wei2022chain, temsah2024openai} is a notable example, where a model is prompted to generate a series of intermediate reasoning steps before arriving at the final answer. While effective, CoT can be sensitive to prompting and may not always yield consistent results. TTS, a related but distinct paradigm, further improves performance by moving beyond a single, deterministic output. Instead, TTS methods sample multiple candidate outputs and aggregate them to form a more robust and reliable final prediction.

A variety of TTS strategies have been explored, ranging from simple aggregation to more complex reasoning-based methods. Simple approaches like self-consistency \citep{wang2022self} and majority voting rely on aggregating multiple generated outputs to improve reliability. More advanced techniques have significantly pushed performance boundaries on complex benchmarks. For instance, self-refinement \citep{qu2024recursive,madaan2023self} is an iterative approach where a model critiques its own output and then revises it in a feedback loop. Similarly, verifier-based methods \citep{cobbe2021training, uesato2022solving} and process reward models \citep{want2024shepherd} have achieved state-of-the-art results by training a separate model to select the best output. Recent works have validated these approaches on increasingly challenging benchmarks, such as MATH \citep{hendrycks2021measuring}, GSM8K \citep{cobbe2021training}, and the BIG-Bench Hard suite \citep{srivastava2023beyond}, demonstrating their strong performance in mathematical and symbolic reasoning tasks.

Although powerful AI techniques are promising, their application in medicine is still emerging. One direction for improving model performance is scaling ``\emph{deep thinking}'' by increasing a model's computational budget for a single reasoning path, such as by expanding its token limit \citep{huang2025m1}. However, this approach faces significant challenges: it can lead to overthinking \citep{yang2025towards}. 
Consequently, ``\emph{parallel thinking}'' strategies represent another important yet unexplored avenue in the medical domain. A key barrier to those advanced methods (e.g., Best-of-$N$, and beam search \citep{snell2024scaling}) is their reliance on reward models, which are often unavailable in medicine as they demand vast amounts of labeled data for training \footnote{This data is not merely correct answers but expert-annotated process supervision, where a model's step-by-step reasoning is evaluated. The high cost of clinical experts' time and the inherent complexity of medical judgment make acquiring this type of data prohibitively expensive and scarce.}. To this end, this paper explores the application of a reward-free TTS to medical image diagnosis by extending a majority voting strategy into a probabilistic framework that improves reliability.