\newpage
\section{Extended Motivation} \label{sec:motivation-extended}

\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:Attention2}
  {\caption{We replicated the experiment presented in Sec.~\ref{sec:intro} on CheXagent-2 to demonstrate that the problem of attention decay over image tokens during token generation also affects other MLLMs.}}
  {\includegraphics[width=1\linewidth]{Diagrams/attention_chexagent.png}}
\end{figure}

\section{Related Work}

\textbf{Structured Findings Generation.} Findings section of a radiology report is comprised of visual observations from a given chest X-ray. Usually, these are free-text reports but there is a growing body of work that establishes the utility of structured reports. \citet{structured_1} showed that clinicians rated structured reports to be significantly more complete and more effective. \citet{structured_recall} showed that structured reports allowed better recall  of diagnosis and critical findings and overall both referring physicians and radiologists preferred structured reports over free-text reports \cite{structured_preference}. Recently, \citet{srrg} introduced a desiderata for structured reporting where they divided the entire radiology report into predefined sections and within the findings section, they further divided by 8 anatomical headers mentioned previously. They converted the free-text reports of MIMIC-CXR and CheXpert Plus to structured reports and introduced two new datasets called SRRG-Findings and SRRG-Impression. \citet{csrrg} further added clinical context like multiple views, clinical indication, imaging techniques used and prior studies to give a new dataset called contextualized SRRG (C-SRRG).

Beyond clinical utility, in automated report generation systems, structured reports help mitigate distributional shift between textual reports originating from different datasets, where the same clinical finding may be described in markedly different styles due to linguistic, institutional, or regional differences among radiologists. By standardizing both the reporting categories and the linguistic style, structured reports reduce this variability and provide more consistent supervision for model training. Additionally, the natural division of the findings section into well-defined anatomical categories enables category-wise parametrization and modular report generation. We believe this structure promotes stronger visual grounding by preventing over-reliance on language priors and by reducing the number of tokens generated within each continuous forward pass.

\noindent
\textbf{Contrastive Decoding.} Contrastive decoding (CD) is a training-free inference time strategy for reducing hallucinations in text generative models \cite{contrastive-open-ended, vcd, contrastive-reasoning}. The main idea of CD is to overcome statistical biases (like object co-occurrences) inherent in the training data and in case of MLLMs, prevent over-reliance on textual priors learned during the pre-training of the LLM. Contrasting with the distribution produced after masking the key information required to generate the correct output penalizes the tokens that are generated when the key information is missing, effectively exposes the prior bias of the model. Various approaches for CD in MLLMs have been tried, \citet{vcd} contrast output distributions derived from original and distorted visual inputs, \citet{itav} contrast inter-layer representations, \citet{crg} contrast model outputs produced with and without visual prompts. While CD has worked well for mitigating hallucinations in natural image captioning tasks, its use for medical tasks has been very limited. \citet{contrastive-medical} developed Alternative CD for medical information extraction task, where they alternately contrasted output distributions from sub-task modules. \citet{ccd} introduces a dual-stage CD mechanism for RRG. Both \citet{contrastive-medical} and \citet{ccd} contrast with text based approaches, whereas, to the best of our knowledge, we are the first to introduce an image based CD approach for RRG i.e., the contrasted distribution is generated by masking the X-ray instead of masking the text. 

\section{Datasets} \label{sec:dataset-appendix}

\noindent
\textbf{MIMIC-CXR} dataset is a large publicly available collection of de-identified chest radiographs and accompanying free-text radiology reports. The dataset was sourced from the Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA, and includes imaging studies collected as part of routine clinical care between 2011 and 2016. It contains $377,110$ chest X-ray images corresponding to $277,827$ imaging studies from $65,379$ patients. Most studies include both frontal (anteroposterior or posteroanterior) and lateral views, and the original images are stored in DICOM format. We use the JPEG format images provided in MIMIC-CXR-JPG \cite{mimic-jpg}.

All images in the dataset were acquired as part of routine clinical care using standard radiography equipment in a hospital environment and were subsequently de-identified in accordance with HIPAA regulations. The dataset was not curated for specific diseases; instead, it preserves the natural distribution of thoracic conditions and imaging characteristics encountered in real-world clinical practice. As a result, the images exhibit substantial clinical variability, including differences in patient positioning (e.g., anteroposterior and posteroanterior views), acquisition settings, image quality, and the presence of medical devices. The accompanying radiology reports were produced by board-certified radiologists at the time of image acquisition and are temporally aligned with the imaging studies. \\

\noindent
\textbf{SRRG-Findings} dataset is derived from the findings section of reports in MIMIC-CXR and Chexpert-Plus \cite{chexpert-plus}, which are converted into a standardized structured format using GPT-4 \cite{gpt4} following a strict set of desiderata. In SRRG, each free-text findings section is reorganized under a fixed set of anatomical headers: Lungs and Airways, Pleura, Cardiovascular, Hila and Mediastinum, Tubes, Catheters and Support Devices, Musculoskeletal and Chest Wall, Abdomen, and Other. Within each category, observations are expressed as bullet-point statements. \\

\noindent
\textbf{IU-Xray} dataset from Indiana University is a publicly available chest X-ray dataset comprising 8,121 chest X-ray images and 3,996 associated radiology reports, collected from the picture archiving systems of the Indiana Network for Patient Care. The images and reports were de-identified automatically and then manually verified in accordance with HIPAA guidelines. For our evaluation, we randomly select 20\% of the data as the test set, following previous work \cite{r2gen}. 





\section{Grounding DINO Fine-Tuning} \label{sec:vp-extractor}

Grounding DINO is an open-set object detector that takes an image and a text prompt as input and outputs bounding boxes corresponding to the specified text. While it demonstrates strong performance on natural images, we fine-tune Grounding DINO on LATTE-CXR to extract category-specific bounding boxes aligned with our anatomical headers.

As described in Sec.~\ref{sec:datasets}, LATTE-CXR contains 13,751 sentence–bounding box pairs. Each sentence–box pair is classified into one of eight anatomical categories using DeepSeek. The training set consists of 8,850 bounding box–anatomical region pairs, which are used to fine-tune Grounding DINO.

During fine-tuning, we optimize a contrastive loss \cite{contrastive} between object features and text tokens for classification, along with L1 and GIoU \cite{glou} losses for bounding box regression.

During inference for a given anatomical category, we input the chest X-ray and the corresponding anatomical header, and the model returns one or more relevant bounding boxes.


\section{Using Visual Prompts}

In this section, we study the role of visual prompts (VPs) in our framework. While VPs have been used in prior work to enhance medical visual question answering (MedVQA) \cite{vp-vqa-acl} and zero-shot classification \cite{vp-zero-shot-classification-midl}, to the best of our knowledge, no prior study has leveraged VPs in a training-free manner specifically to improve radiology report generation.

Since CWCD employs masked VPs during evaluation, we ensure a fair comparison by providing the baseline LLaVA-Rad model with VPs in two ways: (i) $\alpha$ blended visual prompts on the input X-ray, following prior work \cite{vp-vqa-acl, vp-zero-shot-classification-midl}, and (ii) masked VPs for contrastive decoding combined with vocabulary subselection (VS), effectively extending the approach of \citet{crg} with VS.

As shown in Tab.~\ref{tab:vp-fair}, both approaches (rows 2 and 3) perform worse than category-wise report generation (CW, row 4), where no VPs are provided. We hypothesize that the $\alpha$ blended VP approach is less effective for radiology report generation than for MedVQA or zero-shot classification due to the open-ended nature of the task and the larger number of visual prompts per X-ray (4–5 vs. 1–2 in MedVQA).

Overall, these results suggest that addressing the fundamental issue of attention decay in MLLMs through category-wise report generation provides the largest performance gains, while the inclusion of masked VPs offers modest additional improvements.

\begin{table}[htbp]
\floatconts
  {tab:vp-fair}%
  {\caption{Ablation study of CWCD on dataset defined in Sec. \ref{sec:datasets} using ground truth VPs from LATTE-CXR. VS stands for Vocabulary Subselection. VP stands for Visual Prompt. CW stands for Category-Wise report generation.}}%
  {%
  \begin{tabular}{l|c|c|c|c|c|c}
    \hline
    \bfseries Model & \bfseries VP & \bfseries BL-4 & \bfseries BS &  \bfseries R-L & \bfseries F1Rad & \bfseries F1 \\
    \hline
    LLaVA-Rad (Baseline) & No & 4.74 & 48.31  & 30.45 & 24.06 & 62.12 \\
    \hline
    LLaVA-Rad ($\alpha$ blended VP) & Yes& 3.34 & 44.33  & 27.30 & 19.22 & 49.15 \\
    LLaVA-Rad (CD+VS) & Masked & 5.13 & \underline{49.75} & 31.62  & 24.70 & 59.98 \\
    CW              & No & \underline{6.46} & 49.58 &  \underline{32.91} & \underline{27.31} & \textbf{62.57} \\
    \rowcolor{green!20}
    \hline
    CWCD  & Masked & \textbf{6.60} & \textbf{50.22} &  \textbf{33.27} & \textbf{27.96} & \underline{62.51} \\
    \hline
  \end{tabular}
  }
\end{table}

\section{The Masking Mechanism}

While generating a structured radiology report for a particular category, all pixels on and within the corresponding bounding boxes are blacked out (RGB value of 0,0,0), effectively removing the underlying visual information from the input image, as shown in Fig.~\ref{fig:CWCD-1}. As a result, the MLLM generates tokens conditioned only on the remaining regions of the X-ray and the previously generated text tokens.

This masking mechanism is critical for contrastive decoding, as it enables a controlled comparison between tokens produced with and without access to the relevant visual region. By fully removing category-specific visual evidence, differences in the resulting outputs reflect the model’s reliance on that region for generating category-specific descriptions. Partial masking or soft attenuation may allow residual visual cues to persist, weakening the contrastive signal. Therefore, complete masking provides a clear intervention for isolating the contribution of the masked region to the generated text.

\subsection{Hyperparameter Tuning} \label{sec:beta}

We analyze the effect of the vocabulary threshold hyperparameter $\beta$, which controls the minimum log-probability cutoff relative to the highest-probability token at each decoding step (Eq. \ref{eq:vocab}). Tables \ref{tab:cat-abl21} and \ref{tab:cat-abl22} show the impact of varying $\beta$ on NLG and clinical efficacy metrics, with the baseline without Vocabulary Subselection highlighted in \textcolor{red}{red} and the chosen $\beta$ in \textcolor{green}{green}.

Very low values of $\beta$ (0.00–0.01), corresponding to minimal filtering, lead to lower overall performance in both NLG and clinical metrics, indicating that including low-probability tokens increases the risk of generating irrelevant or spurious content. Moderate values of $\beta$ (0.10–0.50) show steady improvements, with $\beta=0.50$ achieving the best balance and strongest overall performance. Higher thresholds (0.75–0.90) maintain competitive results but offer limited additional gains and may slightly restrict the generation of relevant content.

Overall, these trends demonstrate that vocabulary subselection is a critical component of CWCD, and that an appropriately chosen $\beta$ effectively balances linguistic quality with clinical correctness.

\begin{table}[htbp]
\floatconts
  {tab:cat-abl21}%
  {\caption{Effect of the hyperparameter $\beta$ (Eq. \ref{eq:vocab}) on CWCD's overall performance on \textbf{NLG} metrics. $\beta$ used in CWCD is highlighted in \textcolor{green}{green} and the baseline without Vocabulary Subselection is highlighted in \textcolor{red}{red}.}}%
  {%
  \begin{tabular}{l|cccc|c|ccc}
    \hline
    \bfseries $\beta$ & \bfseries BL-1 & \bfseries BL-2 & \bfseries BL-3 & \bfseries BL-4 & \bfseries BS & \bfseries R-1 & \bfseries R-2 & \bfseries R-L \\
    \hline
    \rowcolor{red!20}
    0.00 & 27.15 & 16.11 & 10.80 & 6.23 & 49.77 & 34.15 & 19.27 & 32.00 \\
    0.01 & 27.21 & 16.15 & 10.84 & 6.25 & 49.79 & 34.19 & 19.29 & 32.05 \\
    0.10 & 27.63 & 16.56 & 11.14 & 6.37 & \underline{50.22} & 34.82 & 19.86 & 32.69 \\
    0.25 & \underline{27.66} & \underline{16.57} & 11.20 & 6.39 & 50.18 & 35.00 & 20.02 & 32.95 \\
    \rowcolor{green!20}
    0.50 & \textbf{27.76} & \textbf{16.77} & \textbf{11.53} & \textbf{6.60} & \textbf{50.22} & \textbf{35.26} & \underline{20.25} & \textbf{33.27} \\
    0.75 & 27.40 & 16.43 & \underline{11.39} & \underline{6.52} & 49.82 & \underline{35.05} & \textbf{20.26} & \underline{33.10} \\
    0.90 & 27.22 & 16.34 & 11.34 & 6.44 & 49.64 & 34.89 & 20.22 & 32.96 \\
    \hline
  \end{tabular}
  }
\end{table}

\begin{table}[htbp]
\floatconts
  {tab:cat-abl22}%
  {\caption{Clinical Efficacy Metrics.}}%
  {%
  \begin{tabular}{l|ccc|ccc}
    \hline
    \bfseries $\beta$ & \bfseries F1Rad-S & \bfseries F1Rad & \bfseries F1Rad-C & \bfseries Pr & \bfseries Rc & \bfseries F1 \\
    \hline
    \rowcolor{red!20}
    0.00 & 31.21 & 26.53 & 23.22 & 65.53 & 59.76 & 60.40 \\
    0.01 & 31.30 & 26.62 & 23.31 & 65.61 & 59.78 & 60.46 \\
    0.10 & 31.96 & 27.25 & 23.90 & 66.79 & 60.23 & 61.33 \\
    0.25 & 32.30 & 27.45 & 24.06 & 67.68 & 60.68 & 61.97 \\
    \rowcolor{green!20}
    0.50 & \textbf{32.96} & \textbf{27.96} & \textbf{24.60} & {68.59} & \underline{61.08} & {62.51} \\
    0.75 & \underline{32.34} & \underline{27.49} & \underline{24.23} & \underline{68.75} & \textbf{61.21} & \textbf{62.68} \\
    0.90 & 32.23 & 27.41 & 24.08 & \textbf{68.76} & 61.07 & \underline{62.58} \\
    \hline
  \end{tabular}
  }
\end{table}