\section{Introduction} \label{sec:intro}

Over the past two decades, the rapid advancement of Artificial Intelligence (AI) has significantly improved automated interpretation of medical images \cite{medical-survey, diagnostic-imaging}, particularly chest X-rays, which remain one of the most frequently performed diagnostic procedures worldwide \cite{common}. Chest X-rays are highly valued due to their low cost, minimal radiation exposure, and ability to provide substantial clinical information. Despite these advantages, generating radiology reports remains a cognitively demanding and time-consuming task \cite{report-difficult}. Compounding this challenge, the growing demand for interpreting chest X-rays has outpaced the supply of radiologists \cite{shortage}, leaving many radiologists overworked and vulnerable to fatigue \cite{fatigue}.


\begin{figure}[htbp]
\floatconts
  {fig:CWCD-1}
  {\caption{Category-Wise Contrastive Decoding (CWCD) generates a category-wise structured report under eight anatomical headers by contrasting a normal X-ray with a masked X-ray (3 categories shown here for brevity).}}
  {\includegraphics[width=1\linewidth]{Diagrams/Figure_1.pdf}}
\end{figure}

Automated Radiology Report Generation (RRG), the task of producing free-text descriptions of visual observations from a radiology image, such as a chest X-ray, has therefore emerged as an essential research direction \cite{rrg-importance1, rrg-importance2}. However, automated RRG remains fundamentally challenging: unlike natural images, chest X-rays exhibit low contrast and may contain subtle, highly localized pathologies. The requirement to generate long, unconstrained textual reports imposes additional demands on model fidelity. Unlike visual question answering, which operates within relatively short, focused outputs, comprehensive radiology findings reports may exceed 200 tokens and the model must reason jointly over multiple, often overlapping, anatomical regions. 

Early encoder-decoder approaches \cite{encoder-decoder-1, encoder-decoder-2} established a strong foundation and were able to generate linguistically cohesive reports, however, they often lagged in clinical efficacy \cite{encoder-decoder-bad}. The rise of Large Language Models (LLMs) \cite{gpt2, llama} and subsequently multi-modal LLMs (MLLMs) \cite{llava, flamingo} enabled the development of the first generation of radiology foundation models \cite{radfm, chexagent, maira1, radialog, r2genGPT}. These models leveraged the superior language modeling and linguistic reasoning capabilities of LLMs and substantially scaled parameter counts to surpass the then state-of-the-art encoder-decoder models. They delivered remarkable  improvements in clinical efficacy metrics and demonstrated stronger generalization performance on out-of-distribution datasets \cite{radialog}.

The second generation of radiology foundation models further advanced performance: \citet{llavarad} employed GPT-4 \cite{gpt4} to refine training data by removing temporal comparisons, references to prior exams and unnecessary language variations, while \citet{maira2} expanded the textual context to include indications, technique and comparison, and the visual context by including lateral and prior frontal views. Despite these advances, these foundation models remain constrained by a core limitation of MLLMs: the reduction in attention values over image tokens as more tokens are generated \cite{hallucination1, hallucination2}.

\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:Attention}
  {\caption{LAMA score calculated from 100 randomly sampled images from MIMIC-CXR dataset using LLaVA-Rad over text tokens (left) and image tokens (right). During the report generation process, we observe a pronounced decline in attention to image tokens accompanied by a steady increase in reliance on linguistic priors.}}
  {\includegraphics[width=0.80\linewidth]{Diagrams/attention.png}}
\end{figure}

\textbf{Motivation.} We observe that, as report generation progresses, the model's attention increasingly relies on prior linguistic context rather than the image information. The maximum weight in multi-head attention layer \cite{attention} can be interpreted as a signal of the model's strong confidence in the corresponding input token \cite{mulithead-confidence, opera}. Based on this insight, we define \emph{Layer-Averaged Max Attention (LAMA)}, which
can be computed over any subset of target tokens~$S$ (e.g., image tokens or generated
text tokens). Let $A^{(l,h)}_t \in \mathbb{R}^{N}$ denote the attention weights for
generated token $t$ in layer $l$ and head $h$. Then the LAMA score at step~$t$ is:
\begin{equation} \label{eq:lama}
\text{LAMA}_t(S)
=
\frac{1}{L} \sum_{l=1}^{L}
\max_{h} \left( \sum_{i \in S} A^{(l,h)}_t[i] \right).
\end{equation}
  
From the MIMIC-CXR \cite{mimic} dataset, we compute $\text{LAMA}_t(S_{\text{vis}})$, where $S_{\text{vis}}$ denotes the set of all image tokens, for 100 randomly sampled X-rays from the test set. We observe a clear downward trend in $\text{LAMA}_t(S_{\text{vis}})$ over the generation steps (Fig. \ref{fig:Attention}), suggesting a decay in attention to the image tokens during the generation process, accompanied by an increase in attention over the language priors. We hypothesize that this causes the model to learn spurious co-occurrences of pathology due to inherent biases in the training datasets. A typical example of such spurious pathology co-occurrence arises with cardiomegaly and pulmonary edema. In many cases, these two findings frequently appear together because both are associated with congestive heart failure \cite{cooccurrence}. As a result, when the model increasingly relies on textual priors, the presence of cardiomegaly alone serves as a language cue that strongly biases subsequent tokens toward the associated pathology (pulmonary edema in this case), even if the visual evidence is absent. Similarly, pleural effusion (fluid accumulation) can mechanically lead to some degree of rounded atelectasis (lung collapse) due to compression \cite{cooccurrence2}. This statistical co-occurrence can also lead the model to generate spurious findings simply because they commonly appear together in the training distribution, rather than being grounded in the underlying image evidence.

% Our key contributions are:

Given these observations, we introduce \textbf{Category-Wise Contrastive Decoding}, a novel and modular method that is designed to enhance \emph{structured findings generation} in radiology foundation models. Category-Wise Contrastive Decoding aims to mitigate the problems of generating spurious co-occurrences and reduced attention on visual tokens with increase in output length in two ways: (i) Category-Specific Parametrization - We generate a findings report \emph{category-wise} under eight anatomical headers, as defined by \citet{srrg}: Lungs and Airways, Pleura, Cardiovascular, Hila and Mediastinum, Tubes, Catheters, and Support Devices, Musculoskeletal and Chest Wall, Abdominal, and Other. Henceforth, we refer to these anatomical headers as categories of a structured radiology report. (ii) Masked Contrastive Decoding - An inference time strategy, where instead of normal greedy decoding, we sample from a contrasted distribution obtained by masking the X-ray using category-specific visual prompts. Introducing a contrastive objective at inference time prevents hallucinations arising from prior language bias learned during training.
