\section{Experimental Setup}
\subsection{Dataset}
\noindent\textbf{Training.}
We train our models on the RadGenome-ChestCT dataset \citep{zhang2024radgenome}, a lightweight structured variant of CT-RATE \citep{hamamci2026generalist}. RadGenome-ChestCT provides anatomy-level sentences for all CT-RATE reports, enabling the construction of balanced anatomy-specific datasets using ALO (Section~\ref{sec:alo_method}). For all experiments, we set the oversampling factor to \(1\). Table~\ref{tab:anatomy_oversampling} summarizes the total number of anatomy-level training samples, the prevalence of abnormal findings, and the resulting dataset sizes after applying ALO. Additional pathology distribution statistics are provided in Appendix~\ref{app:pathology_distribution}.

For CT volume preprocessing, we follow CT-CHAT \citep{hamamci2026generalist}. Volumes are resampled to a uniform voxel spacing of \(0.75\,\text{mm} \times 0.75\,\text{mm} \times 1.5\,\text{mm}\) and resized to a fixed shape of \(480 \times 480 \times 240\) using center-cropping or padding. HU values are clipped to \([-1000,\,1000]\) and normalized to the range \([-1,1]\).

\begin{table}[h]
    \centering
    \caption{Anatomy-level sample counts in the RadGenome-ChestCT train split (23,880 total samples), including prevalence of abnormal findings, and the ALO dataset size.}
    \label{tab:anatomy_oversampling}
    \begin{tabular}{lcccc}
        \hline
        \textbf{Anatomy} &
        \textbf{Total Samples} &
        \textbf{Abnormal} &
        \textbf{Abnormal \%} &
        \textbf{ALO Dataset} \\
        \hline
        Lung                 & 23{,}494 & 19{,}079 & 81.1 & 42{,}573 \\
        Trachea\&Bronchi     & 21{,}754 &  1{,}731 &  7.9 & 23{,}485 \\
        Mediastinum          & 23{,}438 &  9{,}515 & 40.7 & 32{,}953 \\
        Heart                & 23{,}048 &  6{,}364 & 27.6 & 29{,}412 \\
        Esophagus            & 20{,}553 &  3{,}335 & 16.3 & 23{,}888 \\
        Pleura               & 17{,}983 &  6{,}653 & 37.0 & 24{,}636 \\
        Abdomen              & 23{,}307 &  7{,}498 & 32.2 & 30{,}805 \\
        Bone                 & 23{,}235 &  1{,}530 &  6.6 & 24{,}765 \\
        Others               &  6{,}210 &  1{,}343 & 21.6 &  7{,}553 \\
        \hline
    \end{tabular}
\end{table}

\noindent\textbf{Validation.}
For validation, we use three public datasets covering both internal and external distributions. First, we evaluate on the official CT-RATE validation split \citep{hamamci2026generalist}, which contains \(1{,}564\) studies. We additionally submit predictions to the VLM3D challenge leaderboard \citep{hamamci2026generalist}, which evaluates performance on a hidden in-center validation set comprising \(2{,}000\) patients.

As an external benchmark, we use the RAD-ChestCT dataset \citep{draelos2021machine}, which includes \(3{,}630\) chest CT studies with 16 pathology labels (label mappings follow \citet{hamamci2026generalist} and are reported in Appendix~\ref{app:class_mapping}). Because the second external dataset used in CT-RATE (i.e., UPMC) is not publicly accessible, we replace it with AMOS-MM,\footnote{AMOS-MM Dataset: \url{https://era-ai-biomed.github.io/amos/dataset.html\#overview}} which provides \(510\) CT scans covering the chest \citep{ji2022amos}. To ensure compatibility with CT-RATE, we extract only chest slices and reports. The pathology labels are obtained using the CT-RATE report classifier\footnote{Report classifier: \url{https://huggingface.co/datasets/ibrahimhamamci/CT-RATE/tree/main/models}}. More details about AMOS-MM preprocessing can be found in Appendix~\ref{app:amos_processing}.

\subsection{Baseline Methods}
We compare ALO against four baselines that share the same model architecture, vision encoder, and training setup. (1) CT-CHAT is a public model trained on 2.7 million question-answer pairs which also include free-text reports from CT-RATE. (2) Free-Text trains the model on free-text, patient-level reports. (3) Structured uses anatomy-level report decomposition but trains a single model on all anatomies without balancing. (4) Anatomy Experts trains separate anatomy expert models on the anatomy-specific findings sections while preserving the original class imbalance.

\subsection{Evaluation}
We evaluate all models using the VLM3D challenge protocol\footnote{VLM3D challenge: \url{https://reportgen.vlm3dchallenge.com}} for classification-based metrics and complement it with RadEval \citep{xu2025radeval}, to provide a comprehensive set of clinical and NLG measures. Performance is assessed at both patient and anatomy levels. Following the VLM3D protocol, we evaluate both the findings and impressions sections. For CT-RATE and AMOS-MM, we report both RadEval and VLM3D metrics. For RAD-ChestCT, we report only classification-based metrics because textual reports are not available.
In our analysis, we highlight Precision, Recall, F1-score (Table \ref{tab:main_results}), and pathology-level metrics (Figures \ref{fig:ablation_ct-rate} and \ref{fig:ablation_rad-chestct}), as these better reflect sensitivity to abnormal findings in imbalanced report generation settings, while still reporting the full set of clinical, NLG, and classification metrics for completeness.

\subsection{Implementation Details}
\textbf{Findings VLM.}
We finetune Meta-Llama-3.1-8B-Instruct\footnote{VLM findings LLM: \url{https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct}} with CT-CLIP\footnote{CT-CLIP: \url{https://huggingface.co/datasets/ibrahimhamamci/CT-RATE/tree/main/models/}} as a frozen vision encoder. Anatomy expert models are trained independently with LoRA adapters \citep{hu2022lora}. Training follows Adam with a cosine schedule (lr \(2\times10^{-5}\)), effective batch size 16, for 10 epochs on four NVIDIA A100 (80\,GB) GPUs.

\noindent\textbf{Impressions LLM.}
For generating the patient-level impressions section from the anatomy-level findings, we finetune SmolLM3-3B\footnote{Impressions LLM: \url{https://huggingface.co/HuggingFaceTB/SmolLM3-3B}} using Axolotl \citep{axolotl} for 3 epochs with Adam and a cosine schedule (lr \(1\times10^{-4}\)), using an effective batch size of 1,024 on four NVIDIA A40 (40\,GB) GPUs.