\section{Methods}

ALO aims to increase the sensitivity of radiology report generation models to pathological findings by reorganizing the task at the anatomy level and correcting the strong imbalance between normal and abnormal findings. ALO consists of three steps: (1) structuring and labeling reports at the anatomy level, (2) balancing the distribution of normal and abnormal findings, and (3) training anatomy-specific expert models.

\subsection{Anatomy-Level Structuring and Labeling}
To obtain consistent supervision, we convert each free-text, patient-level report into a set of anatomy-level findings
(Figure~\ref{fig:method_overview}) using the anatomy annotations provided by RadGenome-ChestCT \citep{zhang2024radgenome}. Each extracted anatomy-level finding \(f_i\) is then labeled as normal or abnormal using a report classifier \(C(\cdot)\), which predicts
\[
y_i = C(f_i), \qquad y_i \in \{\text{normal}, \text{abnormal}\}.
\]

The classifier predicts 18 pathology labels per findings section and assigns an abnormal label if at least one pathology is detected, and normal otherwise.

The complete structured report is written as
\[
R = \{ (a_i, y_i, f_i) \}_{i=1}^{N},
\]
where \(a_i\) is the anatomical region, \(f_i\) is the extracted findings text for that region, \(y_i\) is the predicted label, and \(N\) is the total number of anatomical regions considered.

We assess the robustness of the report structuring step via a content preservation analysis on the CT-RATE dataset. Results are reported in Appendix~\ref{tab:structuring_quality}.

\subsection{Balancing Normal and Abnormal Findings}
\label{sec:alo_method}
Normal findings are substantially more frequent than abnormal ones. This imbalance encourages models to repeat normal statements while underreporting pathological findings. To reduce this effect, we increase the presence of abnormal samples during training.

For each anatomical region \(a\), we construct an anatomy-specific training set \(D_a\) consisting of all anatomy-level findings \(f_i\) assigned to \(a\) together with their corresponding normal/abnormal labels \(y_i\). The subset \(A_a \subset D_a\) contains all samples with \(y_i = \text{abnormal}\).
Given an oversampling factor \(x \ge 1\), we construct an ALO-balanced dataset
\[
D_a^{\text{ALO}} = D_a \cup \underbrace{A_a \cup \dots \cup A_a}_{x \text{ times}} .
\]
In other words, we keep all original samples and add the abnormal subset \(A_a\) exactly \(x\) additional times, making the ratio between normal and abnormal samples more balanced.

\begin{figure}[t]
    \centering
    \includegraphics[width=1.0\textwidth]{figures_274/inference.pdf}
    \caption{Inference Pipeline: (1) Expert models generate anatomy-level findings. (2) Impression generation model summarizes anatomy-level findings into impressions.}
    \label{fig:inference_pipeline}
\end{figure}


\subsection{Anatomy-Specific Generation}
We reformulate report generation as a modular, anatomy-conditioned prediction task. A 3D visual encoder $E(\cdot)$ processes a CT volume $V$ to produce a visual embedding $v$.

On top of this embedding, we train a set of anatomy-specific expert generators $\{G_i\}_{i=1}^{N}$, each receiving only the balanced dataset for its corresponding  anatomy $a_i$ (see Figure~\ref{fig:method_overview}). During training, each expert learns to generate the anatomy-level findings section
\[
\hat{f}_i = G_i(v),
\]
allowing the model to specialize in the visual cues, anatomy-specific report phrasing, and structure characteristic of that anatomy.

At inference time, the volume is encoded once to obtain the visual embedding $v$, which is shared by all anatomy experts. The shared embedding provides each expert with global visual context, supporting systemic disease patterns and cross-anatomical correlations. Each expert then operates individually to produce its anatomy-level findings,
\[
\hat{f}_1, \hat{f}_2, \dots, \hat{f}_N.
\]

These findings are concatenated in a fixed anatomical order to form the findings section. A separate language model $I(\cdot)$ acts as an impression agent, converting the findings into a concise clinical summary (see Figure~\ref{fig:inference_pipeline}):
\[
\hat{I} = I\!\left(\mathrm{Concat}(\hat{f}_1,\dots,\hat{f}_N)\right).
\]

The final report mirrors the conventional radiological structure: a detailed anatomy-level findings section followed by a patient-level impression section.
