\section{Results and Discussion}
We first assess the impact of ALO on patient-level across three datasets: CT-RATE (internal validation), RAD-ChestCT (external validation), and AMOS-MM (external validation). The main clinical, NLG, and classification metrics are summarized in Table~\ref{tab:main_results}.

\begin{table}[h]
    \centering
    \caption{Patient-level performance of baseline models and ALO-enhanced variant across three evaluation datasets, reported using clinical, NLG (Natural Language Generation), and classification (CL) metrics. Classification metrics are reported with 95\% confidence intervals. (\textbf{bold} = best on dataset; \textcolor{colHighlight}{\rule{12pt}{8pt}}~highlighted columns show VLM3D relevant metrics).}
    \label{tab:main_results}
    \resizebox{\textwidth}{!}{%
    \begin{tabular}{
        p{2.7cm}
        p{3.1cm}
        >{\centering\arraybackslash}p{1.2cm}
        >{\centering\arraybackslash}p{1.0cm}
        >{\centering\arraybackslash}p{1.6cm}
        >{\centering\arraybackslash}p{1.9cm}
        >{\centering\arraybackslash}>{\columncolor{colHighlight}}p{0.9cm} % CRG highlighted
        p{0.05cm}
        >{\centering\arraybackslash}>{\columncolor{colHighlight}}p{1.1cm} % BLEU highlighted
        >{\centering\arraybackslash}p{1.1cm}
        p{0.05cm}
        >{\centering\arraybackslash}>{\columncolor{colHighlight}}p{2.9cm} % P highlighted
        >{\centering\arraybackslash}>{\columncolor{colHighlight}}p{2.9cm} % R highlighted
        >{\centering\arraybackslash}>{\columncolor{colHighlight}}p{2.9cm} % F1 highlighted
    }
        \hline
        \multirow{2}{*}{\textbf{Dataset}} & \multirow{2}{*}{\textbf{Method}} 
        & \multicolumn{5}{c}{\textbf{Clinical} $\uparrow$} & 
        & \multicolumn{2}{c}{\textbf{NLG} $\uparrow$} & 
        & \multicolumn{3}{c}{\textbf{CL (macro)} $\uparrow$} \\
        \cline{3-7} \cline{9-10} \cline{12-14}
        & & GREEN & RaTE & RadGraph & 1/RadCLIQ & CRG &  & BLEU & BERT & & P & R & F1 \\
        \hline

        % --- CT-RATE ---
        \multirow{5}{*}{\makecell[l]{\textbf{CT-RATE} \\ (1,564 scans)}} 
        & CT-CHAT & 0.437 & 0.664 & 0.200 & 1.235 & 0.367 & & 0.203 & 0.611 & & 0.354 {\scriptsize [0.307, 0.400]} & 0.158 {\scriptsize [0.150, 0.165]} & 0.169 {\scriptsize [0.161, 0.176]} \\
        & Free-Text & 0.435 & 0.659 & 0.201 & 1.225 & 0.353 & & 0.201 & 0.612 & & \textbf{0.389} {\scriptsize [0.331, 0.425]} & 0.097 {\scriptsize [0.091, 0.104]} & 0.115 {\scriptsize [0.106, 0.124]} \\
        & Structured & \textbf{0.489} & \textbf{0.678} & \textbf{0.232} & \textbf{1.276} & 0.356 & & 0.218 & 0.615 & & 0.356 {\scriptsize [0.332, 0.383]} & 0.118 {\scriptsize [0.109, 0.127]} & 0.168 {\scriptsize [0.156, 0.180]} \\
        & Anatomy Experts & 0.480 & 0.675 & 0.216 & 1.246 & 0.364 & & 0.208 & \textbf{0.617} & & 0.349 {\scriptsize [0.328, 0.371]} & 0.147 {\scriptsize [0.139, 0.156]} & 0.190 {\scriptsize [0.180, 0.199]} \\
        & ALO & 0.341 & 0.662 & 0.197 & 1.171 & \textbf{0.385} & & \textbf{0.219} & 0.604 & & 0.332 {\scriptsize [0.318, 0.346]} & \textbf{0.260} {\scriptsize [0.250, 0.270]} & \textbf{0.285} {\scriptsize [0.274, 0.295]} \\
        \hline

        % --- RAD-ChestCT ---
        \multirow{5}{*}{\makecell[l]{\textbf{RAD-ChestCT} \\ (3,630 scans)}} 
        & CT-CHAT & - & - & - & - & \textbf{0.385} & & - & - & & 0.320 {\scriptsize [0.296, 0.342]} & 0.178 {\scriptsize [0.173, 0.183]} & 0.173 {\scriptsize [0.167, 0.179]} \\
        & Free-Text & - & - & - & - & 0.362 & & - & - & & 0.352 {\scriptsize [0.314, 0.396]} & 0.114 {\scriptsize [0.109, 0.119]} & 0.130 {\scriptsize [0.124, 0.136]} \\
        & Structured & - & - & - & - & 0.348 & & - & - & & 0.355 {\scriptsize [0.329, 0.382]} & 0.081 {\scriptsize [0.075, 0.087]} & 0.122 {\scriptsize [0.114, 0.130]} \\
        & Anatomy Experts & - & - & - & - & 0.363 & & - & - & & \textbf{0.409} {\scriptsize [0.366, 0.451]} & 0.133 {\scriptsize [0.126, 0.140]} & 0.175 {\scriptsize [0.166, 0.182]} \\
        & ALO & - & - & - & - & 0.381 & & - & - & & 0.328 {\scriptsize [0.317, 0.340]} & \textbf{0.227} {\scriptsize [0.218, 0.235]} & \textbf{0.254} {\scriptsize [0.246, 0.262]} \\
        \hline

        % --- AMOS-MM ---
        \multirow{5}{*}{\makecell[l]{\textbf{AMOS-MM} \\ (510 scans)}} 
        & CT-CHAT & 0.197 & \textbf{0.513} & 0.035 & \textbf{0.635} & 0.339 & & 0.025 & \textbf{0.432} & & \textbf{0.182} {\scriptsize [0.092, 0.215]} & 0.142 {\scriptsize [0.115, 0.168]} & 0.086 {\scriptsize [0.070, 0.103]} \\
        & Free-Text & 0.215 & 0.506 & 0.033 & 0.627 & 0.341 & & 0.022 & 0.425 & & 0.179 {\scriptsize [0.072, 0.235]} & 0.048 {\scriptsize [0.035, 0.058]} & 0.044 {\scriptsize [0.032, 0.058]} \\
        & Structured & 0.209 & 0.507 & 0.032 & 0.606 & 0.349 & & 0.019 & 0.399 & & 0.147 {\scriptsize [0.119, 0.176]} & 0.118 {\scriptsize [0.080, 0.134]} & 0.110 {\scriptsize [0.086, 0.134]} \\
        & Anatomy Experts & \textbf{0.230} & 0.510 & 0.036 & 0.621 & 0.353 & & 0.022 & 0.419 & & 0.157 {\scriptsize [0.128, 0.186]} & 0.118 {\scriptsize [0.089, 0.156]} & 0.103 {\scriptsize [0.083, 0.121]} \\
        & ALO & 0.174 & \textbf{0.513} & \textbf{0.039} & 0.623 & \textbf{0.379} & & \textbf{0.027} & 0.420 & & 0.176 {\scriptsize [0.153, 0.200]} & \textbf{0.214} {\scriptsize [0.184, 0.248]} & \textbf{0.166} {\scriptsize [0.146, 0.185]} \\
        \hline

    \end{tabular}%
    }
\end{table}

On the internal CT-RATE split, ALO substantially improves sensitivity to pathological findings. Compared to the Anatomy Experts baseline without oversampling, Recall increases from 0.147 to 0.260 and F1-Score from 0.190 to 0.285, while Precision only decreases slightly. Models trained on patient-level free-text or patient-level structured reports yield higher traditional clinical metrics (e.g., GREEN, RaTE, RadGraph), with the structured model achieving the strongest overall clinical scores. At the same time, the ALO-enhanced model attains the highest CRG score (0.385), a distribution-aware metric that emphasizes clinically relevant abnormalities and mitigates the tendency of conventional clinical metrics (e.g., GREEN) to favor trivial or normal-dominated predictions \citep{hamamci2025crg}. Classic NLG metrics (BLEU and BERTScore) remain stable across all variants, indicating that ALO primarily affects clinical correctness rather than surface-level fluency or style.

The gains in Recall and F1-Score generalize to external datasets. On RAD-ChestCT, Recall improves from 0.178 (CT-CHAT) and 0.133 (Anatomy Experts) to 0.227 with ALO, with a corresponding F1-Score increase from 0.173 (CT-CHAT) and 0.175 (Anatomy Experts) to 0.254. On AMOS-MM, ALO again achieves the strongest clinical performance, outperforming free-text and structured baselines with higher CRG (0.379), as well as the best Recall (0.214) and F1-Score (0.166).

Taken together, these results demonstrate that balancing anatomy-level supervision via ALO systematically improves abnormality detection on both internal and external datasets, while preserving the overall text quality and structure of the generated reports. Additional results, including our VLM3D challenge submission and extended anatomy-level analyses, are provided in Appendices~\ref{app:challenge_results} and \ref{app:anatomy_level_eval}.

\subsection{Ablation Study: Effect of Anatomy-Level Oversampling}
To isolate the contribution of ALO, we compare the Anatomy Experts baseline to its ALO-enhanced variant on the internal CT-RATE validation set and the external RAD-ChestCT dataset (AMOS-MM results can be found in Appendix~\ref{app:ablations}). Figures~\ref{fig:ablation_ct-rate} and~\ref{fig:ablation_rad-chestct} show radar plots of per-pathology Precision, Recall, and F1-Score for both training strategies. Each axis corresponds to a specific pathology and the curves represent the Anatomy Experts baseline and the ALO-trained model.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.99\linewidth]{figures_274/ablation_ct-rate.pdf}
    \caption{Per-pathology ablation on the internal CT-RATE dataset. The radar plots show Precision, Recall, and F1-Score for each pathology, comparing the Anatomy Experts baseline with the ALO-trained model.}
    \label{fig:ablation_ct-rate}
\end{figure}

On the internal CT-RATE split, the ALO curve consistently encloses the Anatomy Experts baseline for Recall and F1-Score across nearly all pathologies, indicating a systematic reduction in false negatives. At the same time, Precision drops only slightly, suggesting that oversampling introduces minimal additional false positives (see Figure \ref{fig:ablation_ct-rate}).
This effect stems from anatomy-level oversampling, which favors abnormality detection, slightly increasing false positives while substantially reducing false negatives. For time-critical findings such as pulmonary nodules, higher recall is clinically preferable.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.99\linewidth]{figures_274/ablation_rad-chestct.pdf}
    \caption{Per-pathology ablation on the external RAD-ChestCT dataset. The radar plots show Precision, Recall, and F1-Score for each pathology, comparing the Anatomy Experts baseline with the ALO-trained model.}
    \label{fig:ablation_rad-chestct}
\end{figure}

The external RAD-ChestCT ablation shows a similar pattern. ALO improves Recall and F1-Score for several clinically important conditions. Again, Precision remains largely stable compared to the Anatomy Experts baseline. This consistency across datasets supports the conclusion that ALO primarily improves model sensitivity.


\subsection{Qualitative Evaluation}
Figure \ref{fig:qualitative_eval} shows outputs from the lung and heart expert models. The generated reports successfully identify major pathologies, such as ``nonspecific nodule'' and ``increased heart size'' (True Positives). However, discrepancies remain: for instance, the heart model misses the ``atherosclerotic wall calcifications'' (False Negative), consistent with the low overall F1-scores in Table \ref{tab:main_results} and highlighting that current report generation models still struggle with comprehensive abnormality detection. Furthermore, differences in reported normal findings (Figure \ref{fig:qualitative_eval} ``Non-Overlapping Normal Findings'') contribute to lower clinical and NLG scores despite correct identification of primary findings, underscoring the limitation of existing metrics in capturing clinical correctness.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.99\linewidth]{figures_274/qualitative_eval.pdf}
    \caption{Case study for lung and heart. Green marks correct detections (true positive/negative); red marks diagnostic errors (false positive/negative); ``Non-Overlapping Normal Findings'' shows valid healthy findings differing from the reference.}
    \label{fig:qualitative_eval}
\end{figure}


\subsection{Limitations}
While ALO yields substantial gains, we observe three limitations. First, the modular architecture increases inference complexity by requiring separate forward passes for each anatomy and the impressions model, resulting in an $1.6\times$ increase in inference time compared to single-pass baselines. However, this design improves overall clinical performance and enables independent updates for specific anatomy expert models. Second, the pipeline relies on upstream tools for report structuring and labeling. While errors introduced at this stage may propagate to downstream training, quantifying the impact of structuring noise remains an open question for future work. Finally, the external AMOS-MM dataset presents a significant domain gap: as an abdomen-focused dataset, its chest reports are substantially shorter and stylistically distinct from the thoracic-focused CT-RATE, presenting a challenging out-of-distribution test.
