\section{Results and Discussion} \label{sec:experiment}
\subsection{Datasets and Models}
As an initial proof of concept, we first evaluate TTS on text-based medical QA tasks using the Massive Multitask Language Understanding (MMLU) benchmark \citep{hendrycks2020measuring}. We focus on six medically relevant domains: clinical knowledge, medical genetics, anatomy, professional medicine, college biology, and college medicine. Since these are multiple-choice questions, all answer options are included in the prompt along with the question. The detailed prompts and inference setting are provided in the Appendix~\ref{sec:prompt}.

To further assess generalizability across modalities and disease types in medical image diagnosis, we use PneumoniaMNIST, PathMNIST, and RetinaMNIST from MedMNIST v2 \citep{yang2023medmnist}. Specifically, pneumonia detection is performed using PneumoniaMNIST, which consists of 390 pneumonia cases and 234 normal cases from frontal X-ray images. 
PathMNIST is utilized for colorectal cancer classification, containing 1,233 cases of colorectal adenocarcinoma epithelium and 741 cases of normal colon mucosa.
Diabetic retinopathy (DR) classification is explored with RetinaMNIST, which includes 226 cases of referrable (i.e., non-proliferative or proliferative DR) and 174 normal cases from fundus images.
All images are standardized to a resolution of 224 × 224.

For the MMLU benchmark, we primarily evaluate \textsc{Llama-3.1-8B-Instruct} \citep{touvron2023llama} and \textsc{DeepSeek-R1-Distill-Llama-8B} \citep{deepseekai2025}, as well as additional results with \textsc{Llama-3.2-1B-Instruct} and \textsc{Llama-3.2-3B-Instruct}. %  are reported in the Appendix
For the medical image diagnosis, we employ \textsc{Llama-3.2-11B-Vision-Instruct} \citep{touvron2023llama}. Since the second stage of our two-stage inference framework admits flexible model selection, we further experiment with smaller text-only Llama models (1B, 3B, and 8B) as well as the medical-domain–specific \textsc{Med42-v2-8B} model \citep{christophe2024med42}.


\subsection{Comparison with Baselines}

\input{table/table_comparison_mmlu}
\input{table/table_comparison}

We compare TTS effectiveness against conventional baselines across three test-time inference settings: (1) direct answering, (2) one-stage CoT, and (3) two-stage reasoning framework for medical image diagnosis. Each setting is assessed with and without the proposed TTS strategy. 

Table~\ref{tab:comparison_mmlu} and Table~\ref{tab:comparison} show that the TTS strategy consistently delivers substantial performance gains across diverse tasks, models, and test-time inference strategies. While one-stage CoT without TTS often yields marginal or negative gains, TTS shows strong effects on vision-centric tasks through multi-sample aggregation. Our analysis further reveals several key observations:

\begin{itemize}[leftmargin=10pt]
    \item \textbf{Consistent gains across tasks and models.} On the MMLU dataset (Table~\ref{tab:comparison_mmlu}), TTS improves performance across six medical knowledge areas, up to 17.5 percentage points (pp). These trends also hold on the more challenging MedQA benchmark (Table~\ref{tab:medqa}). For medical image diagnosis (Table~\ref{tab:comparison}), TTS consistently boosts AUC and AP scores across modalities, with gains up to 30.4 pp. This indicates the advantage of TTS is not task- or model-specific, but generalizes across text- and vision-based medical tasks.
    
    \item \textbf{TTS outperforms CoT.} While prompt engineering alone yields marginal or unstable effects, as seen in the gap between direct answering and one-stage CoT, TTS consistently produces gains regardless of the prompting strategy. This highlights TTS as a more reliable mechanism for enhancing model performance than reformatting instructions.

    \item \textbf{Strong effects on vision tasks.} Our TTS strategy achieves its most pronounced improvements on vision-centric tasks (Table~\ref{tab:comparison}). Single-pass VLM often overlooks subtle visual cues or produces ambiguous descriptions, whereas the multi-sample nature of TTS allows diverse perspectives to be aggregated into a more reliable representation. These improvements arise from the synergy between structured reasoning and test-time scaling, with scaling serving as the key driver of robustness.

\end{itemize}

\subsection{Scaling Laws for Test-Time Compute}

We analyze TTS effectiveness by systematically varying the number of samples from $N=1$ (i.e., single-pass inference) up to $N=64$ for text-based MMLU benchmarks and up to $N=16$ for vision-based diagnosis tasks. As shown in Figure~\ref{fig:scaling}, \textbf{performance scales monotonically with compute scale}, particularly up to medium sample sizes (e.g., $N \leq 16$): accuracy improves substantially as the model aggregates multiple complementary reasoning processes, reducing reliance on potentially flawed single explanations. More comprehensive results are reported in Figure~\ref{fig:scaling_app} in the Appendix.

\input{figure/fig_tts}


This scaling behavior parallels our theoretical justifications in Corollary~\ref{cor:general} in Section~\ref{sec:analysis} and prior observations in \citet{beeching2024scaling} in language model reasoning tasks. Importantly, the observed gains in medical image diagnosis suggest such scaling laws extend beyond mathematical reasoning to multimodal medical applications.

From a practical standpoint, this result underscores a critical lesson: \textbf{relying on a single model output is unreliable in medical tasks}, as LLMs can generate plausible yet misleading information in specialized contexts\footnote{For instance, across all test-time inference methods, single-sample inference occasionally yields AUCs of only 50–60\% on disease classification tasks.}.
In contrast, test-time compute elevates diagnostic performance up to approximately 80\% without additional fine-tuning or retraining. This improvement is also practically viable: inference cost scales linearly with $N$, as prompt processing is $O(1)$ via KV caching \cite{pope2023efficiently} and self-consistency aggregation is negligible on CPU, meaning Figure~\ref{fig:scaling} directly reflects the cost--performance trade-off. Furthermore, parallelizing generation across multiple GPUs can substantially reduce latency. This highlights TTS as a promising avenue for improving both reliability and safety in real-world medical AI deployment.  

\subsection{Model Capacity Matters for TTS} 
So far, we have observed that TTS consistently improves zero-shot performance and exhibits stronger synergy with models possessing sufficient reasoning ability (e.g., 8B and above). To investigate this trend, we conduct an ablation study on MMLU using smaller models (Table~\ref{tab:comparison_mmlu_small}). While TTS provides modest improvements for direct answering, applying one-stage CoT substantially degrades performance in smaller models. This degradation is further amplified by TTS. For instance, with the 1B model, one-stage CoT cuts accuracy by more than half, and combining it with TTS drives accuracy to near-zero. These results highlight that TTS effectiveness depends critically on baseline model competence. When models exhibit non-trivial accuracy, TTS enhances reasoning; conversely, when models struggle to reason, scaling reinforces biased or uninformative outputs, as we formally show in Proposition~\ref{lem:general} (Section~\ref{sec:analysis}). This underscores that naively introducing reasoning prompts can be counterproductive without sufficient underlying capability.

\input{table/table_comparison_mmlu_small}

\section{When Does TTS Help? An Analytical Justification} \label{sec:analysis}
While self-consistency decoding has demonstrated strong empirical performance in medical applications \citep{singhal2023large, singhal2025toward} and mathematical reasoning tasks \citep{beeching2024scaling}, it remains underexplored whether TTS can be applied across different LLMs and how its scaling behavior unfolds (e.g., whether it converges quickly or grows monotonically). To address this gap, we first present a theoretical analysis of TTS based on self-consistency decoding. Proofs are in Appendix~\ref{sec:proof}.

\paragraph{Setup.}
Consider a $C$-class classification with true class $c^\star$.
A single decode (vote) from the LM yields label $y\in\{1,\dots,C\}$ with
\begin{equation}
\mathbb{P}(y=c^\star)=p, \qquad \mathbb{P}(y=j)=p_j\ \ (j\neq c^\star)
\end{equation}
where $p+\sum_{j\neq c^\star} p_j = 1$.
We draw $N$ i.i.d.\ votes, let $X_j$ be the number of votes for class $j$, and predict by majority vote
$\MV=\arg\max_j X_j$ (break ties uniformly at random). Define the strongest competitor
$q:=\max_{j\neq c^\star} p_j$.

\begin{prop}[Majority vote vs.\ strongest competitor]\label{lem:general}
If $p>q$, then
\begin{equation}
\mathbb{P}(\MV\neq c^\star) \;\le\; (C-1)\,\exp\!\Big(-\tfrac{N}{2}\,(p-q)^2\Big),
\end{equation}
so the error decays exponentially in $N$, and improves as the margin $p-q$ grows (i.e., LM becomes more confident).

Conversely, if $q>p$, then
\begin{equation}
\mathbb{P}(\MV = c^\star) \;\le\; (C-1)\,\exp\!\Big(-\tfrac{N}{2}\,(q-p)^2\Big),
\end{equation}
so $\MV$ amplifies the wrong class as $N$ grows.
\end{prop}

\begin{corollary}[Exponential scaling]\label{cor:general}
If $p>q$, the error of $\MV$ decays exponentially with $N$, and to achieve
$\mathbb{P}(\MV\neq c^\star)\le \delta$ it suffices that
\begin{equation}
N \;\ge\; \frac{2}{(p-q)^2}\,\log\!\Big(\frac{C-1}{\delta}\Big).
\end{equation}
If $q>p$, then $\mathbb{P}(\MV=c^\star)$ decays exponentially in $N$ at the same rate.
\end{corollary}

\paragraph{Summary of the theoretical findings.}
Proposition~\ref{lem:general} shows, if $p>q$, exponential decay of the error in $N$; if $q>p$, majority vote amplifies the wrong label.
Hence: (1) self-consistent TTS improves with larger $N$ \emph{in regimes where the true class has the largest single-pass probability}, and (2) it is \emph{effective only when the LLM is sufficiently confident}, in the sense of a nontrivial margin $p>q$.