\section{Conclusion}
\label{sec:conclusion}

We present a comprehensive study of TTS for clinical decision making, encompassing textual QA benchmarks and medical image diagnosis with both empirical and analytical components. TTS consistently improves performance over single-pass baselines by up to 30.4 percentage points, and we provide scaling laws characterizing when such improvements emerge. Our analysis reveals TTS is most effective when underlying models possess non-trivial baseline competence, as scaling amplifies informative reasoning rather than biased outputs. Experiments confirm TTS generalizes beyond text, yielding strong improvements on vision-centric tasks and extending to domain-specific medical VLMs such as LLaVA-Med (Table~\ref{tab:llava-med}). Crucially, this parallel reasoning approach improves zero-shot performance without costly supervision, addressing the scarcity of high-quality medical annotations for training verifiers or reward models. 

We note that our study focuses on reward-free TTS. Verifier-based methods such as best-of-N remain unexplored because reliable medical reward models do not yet exist as training them demands massive labeled data. Existing efforts remain either text-only~\cite{wang2024process} or domain-specific to radiology reports~\cite{thomas2025process}. Future directions include adaptive TTS strategies that dynamically allocate compute, developing domain-specific reward models for multimodal medical reasoning, and investigating interactions with domain-specialized models to assess clinical workflow integration and trustworthy decision-making.
