

\section{Methods}
\label{sec:proposed}
% \subsection{Problem Setup}
% \label{sec:problem}

This paper presents an approach to modality-agnostic medical decision making with the goal of generating accurate answers to clinically relevant questions based on input data, without requiring task-specific fine-tuning. Formally, a medical decision making problem (including textual question answering (QA) and medical image diagnosis) can be represented as a triplet $(\bx, q, y)$, where: 
\begin{itemize}
    \item $\bx$ denotes the \emph{context}, which may take different modalities.
    \item $q$ represents the \emph{problem description},  expressed as a natural language query about the context.
    \item $y$ is the \emph{ground truth answer}, serving as the target output for the model.
\end{itemize}
% $\bx$ denotes the \emph{context}, which may take different modalities; $q$ represents the \emph{problem description},  expressed as a natural language query about the context; and $y$ is the \emph{ground truth answer}, serving as the target output for the model.
This formalization provides a unified framework that accommodates a broad spectrum of medical QA tasks, ranging from text-based multiple choice to image-based VQA tasks. 
For example, in a medical image diagnosis setting, the context $\bx$ may correspond to a chest X-ray, the query $q$ would be a natural language prompt such as ``Does this patient have pneumonia?'', and the ground truth $y$ is a categorical label (e.g., 0: normal, 1: pneumonia).


\subsection{Test-Time Inference Strategies}
This section describes two widely adopted, single-step inference strategies---zero-shot direct answering and CoT prompting---and introduces a two-stage framework % (illustrated in Figure \ref{fig:framework}) 
specifically designed for medical image diagnosis that explicitly decomposes the reasoning process into two distinct phases. See Appendix~\ref{sec:prompt} for full prompt templates. A graphical illustration of each prompting strategy is provided in Figure~\ref{fig:concept}.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.6\linewidth]{figure/figure1_sample.png} 
    \caption{Comparison of three inference strategies for medical decision making. (a) \textit{Direct Answering}: The model directly produces an answer from the input without intermediate reasoning. (b) \textit{One-Stage CoT}: The model generates step-by-step reasoning before providing the final answer, prompted with phrases like "Let's think step by step." (c) \textit{Two-Stage Reasoning with Self-Consistency Decoding}: For medical image diagnosis, the vision-language model first generates multiple visual descriptions of the input image, then produces candidate answers for each description. These candidate answers are aggregated via self-consistency decoding to produce the final diagnosis.}
    \label{fig:concept}
\end{figure}


\subsubsection{Direct Answering}
Consider a language model (LM), such as Llama 3.2-Vision instruction-tuned model, that takes a textual prompt $q$ and a context $\bx$ and produces an answer $a$. A straightforward way to use such a model for diagnosis is to directly request a prediction of the target categorical label. For instance, we can set 
$q \leftarrow$ \txt{Given a pediatric chest X-ray image, classify it as 0 (normal) or 1 (pneumonia).} The model directly provides a final answer without requiring any reasoning. We refer to this as the \emph{Direct Answering} method.

\subsubsection{Chain-of-Thought (CoT) Prompting}
Alternatively, to elicit \emph{explicit} reasoning, we employ Chain-of-Thought (CoT) prompting. While we retain the same input prompt $q$ used in the Direct Answering method, we enforce a reasoning process by introducing the trigger phrase \txt{Let's think step by step.} as a prefix to the generation \citep{kojima2022large}.

By leveraging the instruction-tuned nature of these reasoning models, this prefix acts as a directive that strictly orders the output: the model first generates intermediate reasoning steps (e.g., detailed clinical observations) and subsequently produces the final answer $a$. This ensures that the diagnosis is grounded in the generated rationale rather than being a direct, unexplained prediction. We denote this approach as the \emph{one-stage CoT} method.
