
\section{Prompts} \label{sec:prompt}

For evaluation on the MMLU benchmark, we employed two primary prompt formats: direct answering and one-stage Chain-of-Thought (CoT). For medical image diagnosis on the MedMNIST dataset, we employed three prompt formats: direct answering, one-stage Chain-of-Thought (CoT), and our proposed two-stage reasoning. 

\subsection{Prompt Structure for MMLU Evaluation}
\subsubsection{Direct Answering Prompt} \label{sec:direct_prompt}

In the direct answering prompt, the model is instructed to select the correct letter choice without providing any intermediate explanation or reasoning. This setup evaluates the model’s immediate knowledge of the subject matter. An example direct answering prompt is shown in the visualization below.

\vspace{1em}

\begin{userbox}
The following are multiple-choice questions (with answers) about \{subject\}. Provide your answer with ``The answer is (X)'' where X is the correct letter choice, with no additional explanation.\\[1ex]
\textbf{Question:} \{question\}\\[1ex]
\textbf{Options:} A. \{o1\}, B. \{o2\}, C. \{o3\}, D. \{o4\}
\end{userbox}


\subsubsection{Chain-of-Thought (CoT) Prompt}
\label{sec:cot_prompt}

The one-stage Chain-of-Thought (CoT) prompt instructs the model to produce step-by-step reasoning before selecting a final answer. We implement this by explicitly asking the model to “think step by step,” then report the final letter choice. This setting assesses the model’s reasoning ability in addition to its factual knowledge.

\vspace{1em}

\begin{userbox}
The following are multiple-choice questions (with answers) about \{subject\}. Think step by step and then finish your answer with ``The answer is (X)'' where X is the correct letter choice.\\[1ex]
\textbf{Question:} \{question\}\\[1ex]
\textbf{Options:} A. \{o1\}, B. \{o2\}, C. \{o3\}, D. \{o4\}
\end{userbox}

\begin{assistantbox}
\textbf{Answer:} Let’s think step by step. 
\end{assistantbox}


\subsection{One-Stage Prompt Structure for MedMNIST}
In medical imaging tasks with one-stage inference (i.e., direct answering and one-stage CoT), we use a direct-instruction format: the model receives a single-turn system prompt that specifies the classification task and the required output format. For one-stage CoT, we append the cue “Let’s think step by step.” to the prompt.

\subsubsection{Pneumonia Detection}
\begin{userbox}
Your task is binary-class classification of `pneumonia' against 'normal'. Given a gray-scale pediatric chest X-ray image, classify it as 0 (normal) or 1 (pneumonia). Make sure to put the answer (and only answer) inside \texttt{\textbackslash boxed\{\}}.
\end{userbox}

\subsubsection{Colorectal Cancer}
\begin{userbox}
Your task is binary-class classification of `malignant: colorectal adenocarcinoma epithelium' against 'normal'. Given a hematoxylin \& eosin stained histological image, classify it as 0 (normal) or 1 (malignant). Make sure to put the answer (and only answer) inside \texttt{\textbackslash boxed\{\}}.
\end{userbox}

\subsubsection{Diabetic Retinopathy}
\begin{userbox}
Your task is binary-class classification of `diabetic retinopathy (DR)' against 'normal'. Given a retina fundus image, classify it as 0 (normal) or 1 (DR). Make sure to put the answer (and only answer) inside \texttt{\textbackslash boxed\{\}}.
\end{userbox}

\subsection{Two-Stage Prompt Structure for MedMNIST}
In the two-stage reasoning setup for medical image diagnosis, the prompt is structured into two phases: (1) a general instruction asking the model to summarize the visual features from a given image, and (2) a task-specific prompt that provides detailed guidelines and questions, combined with the summary generated in the first stage (referred to as the note).

\subsubsection{Stage 1 Prompt for All Tasks}
\begin{userbox}
Summarize the list of key observable features detected in the image using bullet points.
\end{userbox}

\subsubsection{Stage 2 Prompt for Pneumonia Detection}
\begin{userbox}
You are a healthcare professional to provide accurate pneumonia diagnosis. \\
Task: \\
- You will receive a report describing a patient's pediatric chest X-Ray image. \\
- Your goal is to classify: \\
  - 0 = normal \\
  - 1 = pneumonia \\
\\
Guidelines: \\
1. Carefully read the note. \\
2. Decide which class (0 or 1) best matches the clinical features described. Assume that all of the relevant details have been explained in the text. \\
3. Provide your final answer enclosed in \texttt{\textbackslash boxed\{\}} with no additional explanation, e.g., \texttt{\textbackslash boxed\{1\}}. \\
\\
IMPORTANT: \\
- Strictly adhere to the format by outputting only the final grade inside \texttt{\textbackslash boxed\{\}} and nothing else. \\

\textbf{Note:}\\
\{note\}\\
\\
---\\
\\
\textbf{Question:}\\
Based on the above note, what is the correct pneumonia diagnosis?
Please consider that all necessary details have been provided in the text above.
Remember to provide only the class (0 or 1) inside \texttt{\textbackslash boxed\{\}}.
\end{userbox}

\subsubsection{Stage 2 Prompt for Colorectal Cancer}
\begin{userbox}
You are a pathologist to provide an accurate colorectal adenocarcinoma epithelium diagnosis. \\
Task: \\
- You will receive a report describing a patient's hematoxylin \& eosin stained histological image. \\
- Your goal is to classify the tissue type: \\
  - 0 = normal \\
  - 1 = malignant (colorectal adenocarcinoma epithelium) \\
\\
Guidelines: \\
1. Carefully read the note. \\
2. Decide which class (0 or 1) best matches the clinical features described. Assume that all of the relevant details have been explained in the text. \\
3. Provide your final answer enclosed in \texttt{\textbackslash boxed\{\}} with no additional explanation, e.g., \texttt{\textbackslash boxed\{1\}}. \\
\\
IMPORTANT: \\
- Strictly adhere to the format by outputting only the final grade inside \texttt{\textbackslash boxed\{\}} and nothing else. \\
\textbf{Note:}\\
\{note\}\\
\\
---\\
\\
\textbf{Question:}\\
Based on the above note, what is the correct tissue type?
Please consider that all necessary details have been provided in the text above.
Remember to provide only the class (0 or 1) inside \texttt{\textbackslash boxed\{\}}.
\end{userbox}
\clearpage

\subsubsection{Stage 2 Prompt for Diabetic Retinopathy}
\begin{userbox}
You are an ophthalmologist to provide accurate diabetic retinopathy (DR) diagnosis. \\
Task: \\
- You will receive a report describing a patient's retina fundus image. \\
- Your goal is to classify: \\
  - 0 = normal \\
  - 1 = referrable \\
\\
Guidelines: \\
1. Carefully read the note. \\
2. Decide which class (0 or 1) best matches the clinical features described. Assume that all of the relevant details have been explained in the text. \\
3. Provide your final answer enclosed in \texttt{\textbackslash boxed\{\}} with no additional explanation, e.g., \texttt{\textbackslash boxed\{1\}}. \\
\\
IMPORTANT: \\
- Strictly adhere to the format by outputting only the final grade inside \texttt{\textbackslash boxed\{\}} and nothing else. \\

\textbf{Note:}\\
\{note\}\\
\\
---\\
\\
\textbf{Question:}\\
Based on the above note, what is the correct diabetic retinopathy (DR) diagnosis?
Please consider that all necessary details have been provided in the text above.
Remember to provide only the class (0 or 1) inside \texttt{\textbackslash boxed\{\}}.
\end{userbox}


\clearpage
\section{Additional Experiments} \label{sec:add_exp}

\setcounter{table}{0}
\renewcommand{\thetable}{C.\arabic{table}}
\setcounter{figure}{0}
\renewcommand{\thefigure}{C.\arabic{figure}}


\begin{table*}[ht]
\centering
\resizebox{0.75\textwidth}{!}{
{\fontsize{12pt}{15pt}\selectfont
\begin{tabular}{l|cccc}
\toprule
\makecell[{{l}}]{~ \\ \textbf {Method}} & $N=1$ & $N=4$ & $N=16$ & $N=64$ \\
\midrule
\midrule
\multicolumn{5}{l}{\em Llama-3.1-8B-Instruct} \\
\midrule
Direct Answering & 58.0$\pm$4.7 & 60.3$\pm$0.8 & 61.7$\pm$0.5 & \textbf{61.9$\pm$0.3} \\
One-stage CoT & 62.7$\pm$1.5 & 69.9$\pm$0.9 & 74.6$\pm$0.6 & \textbf{75.6$\pm$0.3} \\
\midrule
\midrule
\multicolumn{5}{l}{\em Llama-3.2-3B-Instruct} \\
\midrule
Direct Answering & 47.7$\pm$4.9 & 50.1$\pm$0.8 & 51.2$\pm$0.6 & \textbf{51.9$\pm$0.3} \\
One-stage CoT & 37.9$\pm$2.3 & 42.9$\pm$1.0 & 49.6$\pm$0.9 & \textbf{52.9$\pm$0.5} \\
\midrule
\midrule
\multicolumn{5}{l}{\em Llama-3.2-1B-Instruct} \\
\midrule
Direct Answering & 29.7$\pm$3.4 & 31.3$\pm$1.1 & 33.4$\pm$0.8 & \textbf{34.8$\pm$0.5} \\
One-stage CoT & \textbf{16.1$\pm$1.3} & 12.6$\pm$1.0 & 6.4$\pm$0.5 & 3.1$\pm$0.3 \\
\midrule
\midrule
\multicolumn{5}{l}{\em DeepSeek-R1-Distill-Llama-8B} \\
\midrule
Direct Answering & 37.3$\pm$5.0 & 40.1$\pm$0.9 & 42.3$\pm$0.6 & \textbf{42.7$\pm$0.4} \\
One-stage CoT & 50.2$\pm$1.1 & 55.1$\pm$0.9 & 57.6$\pm$0.7 & \textbf{58.2$\pm$0.4} \\
\end{tabular}
}
}
\vspace{1em}
\caption{MedQA with Self-Consistency TTS. Accuracy (\%) is reported as mean $\pm$ standard deviation across multiple runs. Bold indicates the best performance for each model-prompt combination. TTS consistently improves performance with increasing $N$, except for \textsc{Llama-3.2-1B-Instruct} with CoT, where scaling amplifies degraded reasoning.}
\label{tab:medqa}
\end{table*}


\begin{table*}[ht]
\centering
\resizebox{0.75\textwidth}{!}{
{\fontsize{12pt}{15pt}\selectfont
\begin{tabular}{l|ll}
\toprule
\makecell[{{l}}]{~ \\ \textbf{Method}} & \makecell[{{l}}]{\textbf{SLAKE}} & \makecell[{{l}}]{\textbf{RAD-VQA}} \\
\midrule
\midrule
\multicolumn{3}{l}{\em LLaVA-Med-v1.5-Mistral-7B} \\
\midrule
Direct Answering & 0.52 & 0.59 \\
Direct Answering (\scaling) & 0.53\gain{0.9pp} & 0.58\loss{0.8pp} \\
\midrule
One-stage CoT & 0.54 & 0.54 \\
One-stage CoT (\scaling) & 0.55\gain{1.7pp} & \textbf{0.59}\gain{5.6pp} \\
\midrule
Two-stage Reasoning & 0.65 & 0.57 \\
Two-stage Reasoning (\scaling) & \textbf{0.66}\gain{2.0pp} & \textbf{0.59}\gain{2.0pp} \\
\end{tabular}
}
}
\vspace{1em}
\caption{Results on SLAKE and RAD-VQA using LLaVA-Med-v1.5-Mistral-7B with $N=16$. Two-stage reasoning achieves the best overall performance, and TTS provides consistent improvements across inference strategies.}
\label{tab:llava-med}
\end{table*}



\begin{figure*}[p]
    \centering
    % --- Row 1 ---
    \subfigure[Direct with Llama-1B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_zero_meta-llama_Llama-3.2-1B-Instruct.pdf}
    }\hspace{1em}
    \subfigure[One-stage CoT with Llama-1B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_cot_meta-llama_Llama-3.2-1B-Instruct.pdf}
    }\hspace{1em}
    \subfigure[Direct with Llama-3B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_zero_meta-llama_Llama-3.2-3B-Instruct.pdf}
    }
    \vspace{1em}
    
    \subfigure[One-stage CoT with Llama-3B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_cot_meta-llama_Llama-3.2-3B-Instruct.pdf}
    }\hspace{1em}
    \subfigure[Direct with Llama-8B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_zero_meta-llama_Llama-3.1-8B-Instruct.pdf}
    }\hspace{1em}
    \subfigure[One-stage CoT with Llama-8B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_cot_meta-llama_Llama-3.1-8B-Instruct.pdf}
    }
    \vspace{1em}
    
    \subfigure[Direct with DeepSeek-8B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_zero_deepseek-ai_DeepSeek-R1-Distill-Llama-8B.pdf}
    }\hspace{1em}
    \subfigure[One-stage CoT with DeepSeek-8B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_cot_deepseek-ai_DeepSeek-R1-Distill-Llama-8B.pdf}
    }

    \vspace{2em}
    
    \subfigure[Direct with Llama-11B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_zero.pdf}
    }\hspace{1em}
    \subfigure[One-stage CoT with Llama-11B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_zero_cot.pdf}
    }\hspace{1em}
    \subfigure[Two-Stage with Llama-11B]{
        \includegraphics[width=0.28\linewidth]{figure/ablation_two.pdf}
    }
    \caption{A study examining the effect of sample size (N) in TTS setting. (a)-(h): text dataset. (i)-(k): vision dataset. Increasing the sample size boosts performance across different datasets and inference methods.}
    \label{fig:scaling_app}
\end{figure*}