\section{Introduction}
Radiology reports play a central role in clinical communication, summarizing imaging findings and guiding diagnostic and therapeutic decisions. Automating this process through radiology report generation has emerged as a key challenge in multimodal learning, aiming to bridge visual understanding and clinical language. Recent advances in vision-language models (VLMs) have demonstrated significant progress \cite{buess2025large}, especially when trained on large datasets pairing computed tomography (CT) volumes with radiologist-written reports \citep{hamamci2024ct2rep, blankemeier2026merlin}. Moreover, recent studies show that AI-generated draft reports can reduce reporting time by about 25\% while maintaining diagnostic accuracy \citep{acosta2024impactaiassistanceradiology}, counteracting increasing workload pressures in clinical practice.

\begin{figure}[h]
    \centering
    \includegraphics[width=1.0\textwidth]{figures_274/dataset_preparation.pdf}
    \caption{ALO: (1) Convert free-text, patient-level reports into structured anatomy-level findings. (2) Assign normal or abnormal labels to each anatomical region's findings. (3) Construct balanced datasets through targeted oversampling. (4) Train anatomy expert models on the balanced datasets.}
    \label{fig:method_overview}
\end{figure}

Despite recent progress, most existing approaches formulate report generation as free-text prediction \citep{pellegrini2025radialog, hyland2023maira}, where models directly produce narrative reports from images. While intuitive, this setup inherits the variability of clinical writing: syntax, style, and level of detail differ widely across radiologists and institutions, making both learning and evaluation inconsistent. Additionally, radiology datasets are dominated by normal findings \citep{zhang2024radgenome}, creating severe class imbalance that biases models toward underreporting abnormalities, which are the findings most critical for clinical decision-making.

In response, much of the field has focused on architectural or training innovations \citep{hamamci2025better, hein2025chexalign}, including large multimodal transformers and increasingly large-scale pretraining \citep{liu2025t3d, jiang2025hulu}. Nevertheless, supervision quality and label distribution remain persistent challenges that are not resolved by architectural complexity alone. This motivates a complementary perspective: improving the dataset itself through structure and balance can provide substantial gains in diagnostic relevance and reporting consistency, even without modifying model architectures.

Motivated by these observations, we introduce Anatomy-Level Oversampling (ALO) (Figure~\ref{fig:method_overview}), a simple, yet effective data-centric strategy for structured and balanced report generation. We first organize each free-text report into sections describing individual anatomical regions. To enable balanced sampling, we assign a label to each section indicating the presence of healthy or abnormal findings. ALO then utilizes these labels to apply targeted oversampling, effectively reducing the dominance of normal findings. This setup provides standardized and balanced supervision across anatomical regions and enables more fine-grained evaluation at the anatomy level instead of the patient level used in most existing works. Because ALO operates entirely at the data level, it is architecture-agnostic and can be easily integrated into existing VLM training pipelines. In addition, the anatomy-level formulation makes the training process modular, allowing individual anatomy models to be retrained or updated without affecting the performance of other anatomies.

We evaluate ALO on three public CT datasets and show that this data-centric strategy substantially improves model sensitivity to abnormal findings and overall reporting performance. Based on these results, our main contributions are:


\begin{itemize}
    \item We introduce ALO, a simple and model-agnostic strategy that balances healthy and abnormal findings within radiology reports, reducing normal-findings bias and increasing sensitivity to pathologies.
    \item We present a modular anatomy-level modeling framework in which each anatomical region is trained independently, enabling targeted improvements to individual anatomies without degrading performance on others.
    \item We perform a comprehensive and fine-grained evaluation of report generation models using anatomy-level assessment and a broad suite of clinical, classification, and natural language generation (NLG) metrics, providing substantially more detailed insights than the patient-level evaluations used in most existing works.
\end{itemize}
