
\subsubsection{Two-Stage Reasoning for Medical Image Diagnosis}
According to recent theoretical and experimental evidence \citep{abbe2025far}, the Transformer architecture often benefits when a complex task is decomposed into sub-tasks. 
Motivated by this observation, we propose and investigate a two-stage approach to help the Transformer arrive at a more accurate diagnosis in a medical image diagnosis setting.


\paragraph{Visual Description Generation.}
We first instruct the VLM to generate descriptions on visual features of the input image without directly querying for a diagnosis. % , illustrated in Figure~\ref{fig:prompt}. 
Concretely, we prompt the VLM as follows: 
$v = \mathsf{VLM}(\bx, q_1)$ where $q_1$ can be \texttt{"Describe visual features detected in the image"}.


\paragraph{Diagnosis from Descriptions.}
The generated visual descriptions $v$ are then provided as input to a (potentially different) LLM that produces a final diagnosis. 
For example, we can construct the query: $q_2(v) :=$ \txt{Decide which class best matches the visual features described: 0 (normal) or 1 (pneumonia). **Features:** \{features\}}, where we substitute $\{\texttt{features}\}$ with the previously generated $v$. 
The corresponding diagnosis is characterized as:
\begin{equation}
% a = \mathsf{LLM}\bigl(q_2(v)\bigr) = \mathsf{LLM}\Bigl(q_2\bigl(\mathsf{VLM}(\bx, q_1)\bigr)\Bigr).
a = \mathsf{LLM}\bigl(q_2(v)\bigr) = \underbrace{ \mathsf{LLM}\biggl(\, \underbrace{\mathsf{VLM}(\bx, q_1)}_{\text{Describe}}, q_2\biggr) }_{\text{Diagnose}}
\end{equation}



\subsection{Scaling Test-Time Compute}

General-purpose language models such as \textsc{Llama} or \textsc{DeepSeek} often struggle to provide accurate answers in complex medical tasks, and fine-tuning is prohibitively expensive due to the scarcity of expert-annotated reasoning data \citep{berger2025reasoning, naliyatthaliyazchayil2025evaluating}. To address the issue, we investigate the applicability of the TTS technique---often applied in mathematical reasoning tasks \citep{yao2023tree, snell2024scaling}. In particular, we adopt self-consistency decoding \citep{wang2022self} given the absence of reliable reward models in medicine.

\paragraph{One-Stage TTS.} 
We estimate class probabilities by sampling $N$ independent outputs from a large reasoning model under randomized decoding. Specifically, instead of greedy decoding which deterministically selects the next token with the highest probability, we use temperature sampling with $T=0.7$ \cite{wang2022self}. At each token position, the model's output logits $\mathbf{z} = (z_1, \ldots, z_V)$ are converted to a probability distribution via the softmax: $ p_i = \frac{\exp(z_i / T)}{\sum_{j=1}^{V} \exp(z_j / T)}$ where $V$ is the vocabulary size. The next token is then sampled from this distribution, allowing us to generate $N$ diverse reasoning paths.

Let the label space be $\mathcal{Y}=\{1,\dots,C\}$. For each draw $i\in\{1,\dots,N\}$, the model produces an answer string $a^{(i)}$, which we map to a class via a parsing function $\phi:\text{text}\!\to\!\mathcal{Y}$ (e.g., extracting “A/B/C/D” or $\{1,\dots,C\}$). Denote the parsed class by $\hat a^{(i)}=\phi\!\bigl(a^{(i)}\bigr)\in\mathcal{Y}$. Formally,

$$
\{a^{(i)}\}_{i=1}^N \;\overset{\text{i.i.d.}}{\sim}\; \mathsf{LRM}(\bx,q), \qquad 
\hat y^{(i)}=\phi\!\bigl(a^{(i)}\bigr). \label{eq:one_stage_inference}
$$

Each $\hat y^{(i)}$ can be viewed as a draw from the LM-induced predictive distribution over classes, $p(y\mid \bx,q)$. We estimate these class probabilities by Monte Carlo:

\begin{equation}
\widehat{p}(y=c \mid \bx,q) \;=\; \frac{1}{N}\sum_{i=1}^{N}\mathbb{I}\!\bigl(\hat y^{(i)}=c\bigr).
\end{equation}

The final prediction is the maximum-probability class under this estimate:

\begin{equation}
\hat y \;=\; \arg\max_{c\in\{1,\dots,C\}} \widehat{p}(y=c\mid \bx,q).
\end{equation}

\paragraph{Two-Stage TTS.}
For two-stage inference in medical image diagnosis, we can apply TTS both in the description stage and in the diagnosis stage. Formally,
\begin{align}
\{v^{(i)}\}_{i=1}^N &\;\overset{\text{i.i.d.}}{\sim}\; \mathsf{VLM}(\bx,q_1) \\
\{a^{(i, j)}\}_{j=1}^M &\;\overset{\text{i.i.d.}}{\sim}\; \mathsf{LRM}(v^{(i)},q_2). \label{eq:two_stage_inference}
\end{align}
where $v^{(i)}$ denotes the $i$-th description sampled from the VLM in the first stage, and $a^{(i,j)}$ is the $j$-th diagnosis from the language model given that description in the second stage.

Empirically, we observe that even under randomized decoding, the diagnosis $a^{(i,j)}$ remains unchanged for a fixed description $v^{(i)}$. This indicates that the predictive uncertainty originates from the reasoning process (description stage) rather than from the decision-making process (diagnosis stage). Consequently, there is no measurable gain from scaling test-time compute in the second stage, and we therefore set $M=1$. The final class probabilities and prediction are then estimated in the same way as in the single-stage case.