
\section{Experiments and Results}
\subsection{Datasets}\label{sec:datasets}

For training category-wise adapters, we use X-rays from \textbf{MIMIC-CXR} \cite{mimic, mimic2} and we source the corresponding structured findings reports from \textbf{SRRG-Findings} \cite{srrg}. To support category-wise parametrization, we parse each structured report and extract the bullet-point observations corresponding to every anatomical header, thereby constructing eight separate \textbf{category-specific datasets}. Each dataset contains all observations associated with its respective anatomical region to be used for training each category-wise adapter. For generating masks for CWCD, we use bounding-box annotations derived from the {REFLACX} dataset \cite{reflacx} and its derived dataset, {LATTE-CXR} \cite{latte}. 

\textbf{REFLACX} contains 3,032 readings corresponding to 2,616 unique chest radiographs. It provides radiologist eye-tracking data and manually drawn ellipses that indicate abnormal findings, along with synchronized report transcriptions. \textbf{LATTE-CXR} repurposes the REFLACX annotations to generate bounding-box region annotations aligned with the sentences describing the abnormalities. For gaze-based pairs, radiologist fixations during report dictation are aggregated into gaussian heatmaps, processed to retain salient regions, and enclosed in axis-aligned rectangles to form bounding boxes aligned with each sentence. Expert-drawn ellipses from REFLACX are also converted into bounding boxes, providing explicit abnormality localization. These boxes represent regions attended to by radiologists rather than exact lesion boundaries. In total LATTE-CXR includes 13,751 gaze-based region–sentence pairs constructed from 2,742 MIMIC-CXR images. We follow the official MIMIC-CXR split and combine the test and validation sets to obtain a final test set of 912 X-rays. Category-specific bounding boxes are obtained by classifying each sentence–box pair into one of eight anatomical categories using DeepSeek \cite{deepseek}.

Overall, we utilize frontal X-rays from MIMIC-CXR, structured findings reports from SRRG-Findings, and during inference, we employ category-specific bounding boxes from LATTE-CXR. Further details about the datasets can be found in appendix~\ref{sec:dataset-appendix}.

\subsection{Implementation Details}

We use LLaVA-Rad \cite{llavarad} as our baseline MLLM model. LLaVA-Rad uses Vicuna-7b-v1.5 \cite{vicuna} as the base language model and BioMedCLIP \cite{biomedclip} as the image encoder, which is trained on large-scale multimodal biomedical data. For each of the eight categories, we train a rank-1 LoRA adapter, training $\sim$500k parameters per adapter. Across all categories, the total number of parameters trained is equivalent to those in a rank-8 adapter. We trained each adapter for one epoch on the corresponding category-specific dataset. All adapters are trained on a single 80GB A100 GPU. Each adapter takes between 4 and 16 hours, depending on the number of training samples in the category. We use a batch size of 48, a learning rate of 0.0001, and the AdamW \cite{adamw} optimizer.

\subsection{Evaluation Protocol}\label{sec:eval}

\textbf{Baselines.} We comprehensively evaluate against a diverse set of baseline radiology foundation models. All baseline models are pre-trained on the MIMIC-CXR dataset for generating free-text findings reports. \citet{srrg} fine-tuned CheXpert-Plus \cite{chexpert-plus-srrg}, CheXagent-2 \cite{chexagent,chexagent-srrg} and MAIRA-2 \cite{maira2,maira2-srrg} to generate SRR. \citet{csrrg} fine-tuned Lingshu \cite{lingshu,lingshu-srrg} and MedGemma \cite{medgemma,medgemma-srrg} to generate SRR. We trained LLaVA-Rad to generate SRR. CheXpert-Plus and CheXagent-2 were fully fine-tuned. For MAIRA-2 and LLaVA-Rad, rank 8 LoRA adapters were trained. For Lingshu and MedGemma, rank 32 LoRA adapters were trained. \\

\textbf{Metrics.} We evaluated the generated radiology reports using a combination of natural language generation (NLG) and clinical efficacy (CE) metrics, each capturing distinct aspects of report quality. For NLG, BLEU-1–4 \cite{bleu} measures n-gram overlap with reference reports, where lower-order BLEU (e.g., BLEU-1) emphasizes lexical precision and higher-order BLEU (e.g., BLEU-4) captures short phrase consistency. ROUGE-1,2,L \cite{rouge} focuses on recall, measuring how much of the reference content is covered, with ROUGE-L additionally reflecting structural similarity. BERTScore (BS) \cite{bertscore} evaluates semantic similarity using contextual embeddings, capturing meaning even when phrasing differs. 

For clinical validity, F1-RadGraph \cite{radgraph1, radgraph2} evaluates the accuracy of entities (findings, anatomy) and relations, with simple, partial, and complete scores indicating varying levels of clinical precision. We measure the weighted average precision, recall, and F1 score over 55 SRR-BERT labels \cite{srrg}, which enables more diverse evaluation compared to 14 CheXbert \cite{f1chexbert} disease labels. 

\subsection{Results}

We evaluate the Category-Wise Contrastive Decoding (CWCD) framework on the Structured Radiology Report Generation (SRRG) task on the MIMIC-CXR derived test dataset, as defined in Sec. \ref{sec:datasets}, against multiple state-of-the-art radiology foundation models. We conduct the SRRG evaluation in the same way as \citet{srrg}, except that we do not penalize the baseline models for not generating a category or generating an extra category; this results in overall higher baseline scores. CWCD demonstrates consistent improvements over all baseline models across both natural language generation and clinical efficacy metrics. In Table \ref{tab:cat-results1}, CWCD achieves the highest score across all NLG metrics indicating more fluent, coherent, and semantically aligned report generation compared to the baselines.      

Table \ref{tab:cat-results2} shows that CWCD also improves clinical validity, with F1RadGraph scores surpassing all other models. SRR-BERT metrics further confirm that CWCD generates clinically accurate findings with high precision (68.59) while maintaining competitive recall (61.08) and F1-Score (62.51). The higher precision indicates that CWCD produces fewer spurious or irrelevant findings, reducing the generation of pathology co-occurrences that are biased by language priors in the training data. The competitive recall shows that relevant findings are still captured, and the improved F1 suggests a better overall balance between accuracy and coverage. Taken together, the higher F1RadGraph scores, improved precision, and robust F1 indicate that CWCD enhances the overall clinical efficacy of generated reports while mitigating spurious correlations.

\begin{table}[htbp]
\floatconts
  {tab:cat-results1}%
  {\caption{Evaluation of CWCD versus Radiology Foundation Models on SRRG task on \textbf{NLG} Metrics defined in Sec. \ref{sec:eval}. Best scores are in \textbf{bold} and second best are \underline{underlined}.}}%
  {%
  \begin{tabular}{l|cccc|c|ccc}
    \hline
    \bfseries Model & \bfseries BL-1 & \bfseries BL-2 & \bfseries BL-3 & \bfseries BL-4 & \bfseries BS & \bfseries R-1 & \bfseries R-2 & \bfseries R-L \\
    \hline
    CheXpert-Plus      & 24.25 & 13.46 & 8.41 & 3.83 & 47.21 & 31.72 & 15.83 & 29.45 \\
    MedGemma           & 23.60 & 13.74 & \underline{9.14} & 4.59 & 47.67 & 32.80 & 16.91 & 30.13 \\
    Lingshu            & \underline{24.76} & 12.84 & 7.22 & 2.22 & 47.15 & 29.73 & 14.62 & 27.74 \\
    CheXagent-2        & 23.35 & 13.71 & 8.80 & 4.59 & 48.03 & 32.79 & 16.84 & 30.28 \\
    MAIRA-2            & 24.31 & 13.87 & 8.42 & 3.79 & \underline{48.57} & \underline{33.07} & \underline{17.47} & \underline{31.18} \\
    LLaVA-Rad          & 24.22 & \underline{14.45} & 9.00 & \underline{4.74} & 48.31 & 32.79 & 17.06 & 30.45 \\

    \hline

    %Category-Based (CB) & \underline{27.12} & \underline{16.26} & \underline{11.33} & %\underline{6.46} & \underline{49.58} & \underline{34.83} & \underline{20.27} & %\underline{32.91} \\
    \rowcolor{green!20}
    CWCD              & \textbf{27.76} & \textbf{16.77} & \textbf{11.53} & \textbf{6.60} & \textbf{50.22} & \textbf{35.26} & \textbf{20.25} & \textbf{33.27} \\
  \end{tabular}
  }
\end{table}

\vspace{-0.5cm}
\begin{table}[htbp]
\floatconts
  {tab:cat-results2}%
  {\caption{\textbf{Clinical Efficacy} Metrics as defined in Sec. \ref{sec:eval}.}}%
  {%
  \begin{tabular}{l|ccc|ccc}
    \hline
    \bfseries Model & \bfseries F1Rad-S & \bfseries F1Rad & \bfseries F1Rad-C & \bfseries Pr & \bfseries Rc & \bfseries F1 \\
    \hline
    CheXpert-Plus & 28.71 & 22.89 & 19.80 & 62.44 & 59.47 & 58.72 \\
    MedGemma & 30.11 & 24.49 & 21.19 & 63.03 & 60.64 & 59.62 \\
    Lingshu & 27.86 & 23.82 & 20.84 & 56.02 & 53.60 & 52.90 \\
    CheXagent-2 & 30.27 & 24.29 & 21.11 & 64.20 & 60.74 & 60.67 \\
    MAIRA-2 & \underline{30.54} & \underline{25.26} & \underline{22.08} & 65.36 & 60.92 & 61.03 \\
    LLaVA-Rad & 30.30 & 24.06 & 20.92 & \underline{65.48} & \textbf{63.38} & \underline{62.12} \\

    \hline

    %Category-Based (CB) & \underline{32.15} & \underline{27.31} & \underline{24.07} & %\textbf{68.86} & 60.96 & \textbf{62.57} \\
    \rowcolor{green!20}
    CWCD & \textbf{32.96} & \textbf{27.96} & \textbf{24.60} & \textbf{68.59} & \underline{61.08} & \textbf{62.51} \\
  \end{tabular}
  }
\end{table}

\subsection{Ablation Study}

In this section, we conduct an ablation study to understand the contribution of each component in our approach. We perform a systematic ablation on the SRRG-Findings task using the dataset described in Sec. \ref{sec:datasets}. Tab. \ref{tab:cat-abl1} summarizes the results for six model variants, each incrementally adding or removing key mechanisms of the complete CWCD framework. Applying CD and vocabulary subselection (VS) to SRR  yields modest gains (2nd row) across most metrics but also causes a notable drop in F1-SRR-BERT, indicating limited clinical reliability. Introducing Category-Wise parametrization (CW) yields substantial improvements (3rd row) across both NLG and CE metrics, demonstrating the effectiveness of reducing the number of generated tokens within a single set of forward passes. Masking all visual prompts (VP) in CWCD (5th row) further degrades performance, falling even below CW decoding. Similarly, removing VS from CWCD (4th row) results in a significant performance drop, highlighting the importance of filtering out low-probability tokens during CD. Overall, the complete framework, combining CW parametrization, VS, and category-specific VPs achieves the strongest performance across all metrics.  

\begin{table}[htbp]
\floatconts
  {tab:cat-abl1}%
  {\caption{Ablation study of CWCD on SRRG-Findings task on dataset defined in Sec. \ref{sec:datasets}. VS stands for Vocabulary Subselection. VP stands for Visual Prompt. CW stands for Category-Wise report generation. Overall CWCD framework metrics are highlighted in \color{green}{green}.}}%
  {%
  \begin{tabular}{l|c|c|c|c|c|c}
    \hline
    \bfseries Model & \bfseries BL-4 & \bfseries BS & \bfseries R-1 & \bfseries R-L & \bfseries F1Rad & \bfseries F1-SRR \\
    \hline
    LLaVA-Rad (Baseline) & 4.74 & 48.31 & 32.79 & 30.45 & 24.06 & 62.12 \\
    \hline
    LLaVA-Rad w/ CD+VS & 5.13 & 49.75 & 33.86 & 31.62 & 24.70 & 59.98 \\
    CW               & \underline{6.46} & 49.58 & \underline{34.83} & \underline{32.91} & 27.31 & \textbf{62.57} \\
    CWCD w/o VS                      & 6.23 & \underline{49.77} & 34.15 & 32.00 & 26.53 & 60.40 \\
    CWCD w/ all VP          & 6.09 & 49.75 & 34.57 & 32.55 & \underline{27.40} & 62.22 \\
    \rowcolor{green!20}
    CWCD w/ Cat-Spec. VP & \textbf{6.60} & \textbf{50.22} & \textbf{35.26} & \textbf{33.27} & \textbf{27.96} & \underline{62.51} \\
    \hline
  \end{tabular}
  }
\end{table}

\subsection{Out-of-Distribution Performance}

We perform out-of-distribution (OOD) evaluation on the test split of IU-Xray \cite{iu-xray}. Previously, while evaluating performance on the MIMIC-CXR dataset, we used ground truth visual prompt annotations from Latte-CXR. Given that no such annotations exist for IU-Xray, following \citet{vp-vqa-acl, crg}, we use the Grounding DINO \cite{grounding-dino} model to extract visual prompts for each of the eight SRR categories. Further details about fine-tuning Grounding DINO for our use can be found in appendix Sec~\ref{sec:vp-extractor}.

Tables~\ref{tab:ood-nlg} and~\ref{tab:ood-ce} show that CWCD demonstrates strong out-of-distribution generalization, consistently outperforming foundation models across both NLG and clinical efficacy metrics. While MedGemma also exhibits strong OOD performance, this may be partially attributable to its substantially larger fine-tuning capacity, as it employs rank-32 LoRA adapters, whereas CWCD is trained with parameters equivalent to a rank-8 adapter (8 × rank-1). Despite this disparity in adaptation capacity, CWCD achieves the best performance on 11 out of 14 metrics, highlighting the robustness of our method under distributional shift.

\begin{table}[htbp]
\floatconts
  {tab:ood-nlg}%
  {\caption{Evaluation of CWCD on the out-of-distribution IU-Xray test set on NLG Metrics.}}%
  {%
  \begin{tabular}{l|cccc|c|ccc}
    \hline
    \bfseries Model & \bfseries BL-1 & \bfseries BL-2 & \bfseries BL-3 & \bfseries BL-4 & \bfseries BS & \bfseries R-1 & \bfseries R-2 & \bfseries R-L \\
    \hline
    CheXpert-Plus      & 27.03 & 15.46 & 7.70 & 1.77 & 45.06 & 36.95 & 17.93 & 34.78 \\
    MedGemma           & 27.27 & 16.42 & 7.98 & 1.55 & \textbf{48.18} & \underline{40.52} & 19.29 & \underline{35.93} \\
    Lingshu            & 27.15 & 15.20 & 7.08 & \textbf{3.02} & 44.72 & 35.62 & 16.96 & 33.14 \\
    CheXagent-2        & 26.30 & 14.67 & 8.01 & 1.70 & 45.24 & 37.24 & 18.26 & 34.44 \\
    MAIRA-2            & 26.68 & 16.01 & \underline{9.18} & 1.41 & \underline{48.17} & 38.23 & 19.43 & 35.67 \\
    LLaVA-Rad          & \underline{27.63} & \underline{16.48} & 8.44 & 1.68 & 46.06 & 39.08 & \underline{21.00} & 35.79 \\

    \hline
    \rowcolor{green!20}
    CWCD              & \textbf{28.47} & \textbf{17.49} & \textbf{9.83} & \underline{2.00} & 47.81 & \textbf{40.63} & \textbf{22.76} & \textbf{37.53} \\
  \end{tabular}
  }
\end{table}

%\vspace{-0.5cm}
\begin{table}[htbp]
\floatconts
  {tab:ood-ce}%
  {\caption{Evaluation on Clinical Efficacy Metrics.}}%
  {%
  \begin{tabular}{l|ccc|ccc}
    \hline
    \bfseries Model & \bfseries F1Rad-S & \bfseries F1Rad & \bfseries F1Rad-C & \bfseries Pr & \bfseries Rc & \bfseries F1 \\
    \hline
    CheXpert-Plus & 34.76 & 28.45 & 23.05 & 75.24 & 76.56 & 73.70 \\
    MedGemma & \underline{42.31} & \underline{33.79} & \underline{28.90} & 78.67 & \textbf{86.26} & 78.29 \\
    Lingshu & 35.88 & 30.48 & 25.13 & 65.86 & 71.80 & 67.05 \\
    CheXagent-2 & 34.22 & 28.59 & 22.35 & 81.67 & 81.26 & 78.71 \\
    MAIRA-2 & 35.43 & 30.20 & 25.50 & \underline{83.30} & 82.87 & \underline{80.44} \\
    LLaVA-Rad & 41.07 & 33.36 & 27.65 & 81.90 & \underline{83.76} & 79.84 \\

    \hline
    \rowcolor{green!20}
    CWCD & \textbf{42.76} & \textbf{35.73} & \textbf{29.03} & \textbf{89.15} & 82.63 & \textbf{83.79} \\
  \end{tabular}
  }
\end{table}