
\section{Methods}

\textbf{Vision Language Modeling.} Large language models (LLMs) process sequences of text tokens to generate textual output in an autoregressive manner. This mechanism can be extended to images by adding a vision encoder that extracts visual features, which are then projected into the text embedding space so they can be fed to the language model (LM) as additional input tokens. In practice, this is done by using a pre-trained vision backbone (e.g., ViT \cite{vit} or a CNN-based encoder \cite{convllava}) to extract a sequence of visual feature embeddings, which are then mapped into the language model’s embedding space via a learnable multi-modal adapter \cite{blip2, flamingo}, typically implemented as a multilayer perceptron (MLP).
 The resulting image tokens have the exact dimensions as input text tokens, allowing them to be concatenated to the LLM’s input sequence \cite{llava}. This unified token stream is then processed autoregressively by the LLM, enabling it to generate text conditioned both on the input image and text. This architecture serves as the foundation of an MLLM \cite{blip2, llava, flamingo}. Intuitively, this design allows images to be treated as a sequence of ``visual words'' that are compatible with text tokens. By projecting visual features into the same embedding space as text, the language model can jointly reason over both modalities using its standard autoregressive decoding mechanism.

Formally, consider a sample $(I, r)$ where $I$ represents a Chest X-ray image and $r$ represents the corresponding radiology findings report. Given an image encoder $E_{img}(\cdot)$, we obtain visual features $I' = E_{img}(I)$, which are then projected into the LM token embedding space by the multi-modal projector $\lambda(\cdot)$, we get $v = \lambda(I')$. The LM receives both the visual tokens and the text tokens as a single input sequence, usually with visual tokens provided first, followed by textual tokens. Let the textual input tokens be $u$. Then at step $t$, input to the model is $\chi_t = concat(v,u,r_{<t})$. The LM $\theta$ then processes this concatenated input sequence $\chi_t$ to give the hidden state $h_t = \theta(\chi_t)$ that is then passed to the LM head which projects $h_t$ from $d_{m}$ to $|V|$ to get logits $z_t = \theta_{head}(h_t)$, where $d_m$ is the LM's internal dimensionality and $V$ is the vocabulary. Finally, we decode the findings reports auto-regressively from $P(r_t\mid\chi_t) = softmax(z_t)$. At each decoding step, the model predicts the next text token conditioned on the image, the textual prompt, and all previously generated tokens. The final generated report sequence is factorized as,

\begin{equation}
P_{(\theta,\lambda)}(r \mid v, u) = \prod_{t=1}^{|r|} P_{(\theta,\lambda)}(r_t \mid v, u, r_{<t}).
\end{equation}



\subsection{Category-Specific Parametrization}

A free-text radiology findings report can be written as a structured findings report under eight categories (anatomical headers), as mentioned in appendix Sec. \ref{sec:dataset-appendix}. Foundational RRG models fine-tuned on an SRRG dataset are used to generate an SRR via a single continuous decoding process \cite{srrg}. Based on the empirical observation described earlier (Fig. \ref{fig:Attention}), we generate the findings report under each category in multiple \emph{independent} forward passes to maintain visual grounding on the image tokens $v$ and reduce bias arising from excessive attention to previously generated tokens $r_{{<t}}$. By resetting the decoding context for each category, the model is encouraged to attend directly to the image rather than relying on textual priors from earlier sections.

Each structured findings report can be represented as $r = (r_{c_1}, r_{c_2}, \ldots, r_{c_n})$ where, $1 \leq n \leq 8$, and $c_i$ represents a category $i$. As seen in Fig. \ref{fig:methods}, to specialize by category without disregarding the radiology priors of the base MLLM, we use low-rank adaptation (LoRA~\citet{lora}) on top of a base MLLM~\cite{llavarad}. This design enables category level specialization while preserving the general medical knowledge encoded in the base model. Given a foundation MLLM $\theta$ with weights $W$, we train $\Delta W =\Delta\theta_{c_i}$ for each category $c_i$, which decomposes into two low-rank weight matrices, significantly reducing the number of trained parameters. During inference, for every image $I$, we generate the category specific report $\tilde{r}_{c_i}$ using the MLLM $\theta+\Delta\theta_{c_i}$ (henceforth written as $\theta_{c_i}$) and category prompt $u_i$ for all $c_i$. We then concatenate $\tilde{r}_{c_i}$ from all categories to get the predicted structured report $\tilde{r} = (\tilde{r}_{c_1}, \tilde{r}_{c_2}, \ldots, \tilde{r}_{c_n})$. 



\subsection{Category-Wise Contrastive Decoding for RRG}

Traditionally, we sample from the distribution $P(y \mid c,x)$, where y is the output, x is the input, and c is the key context (e.g., an image) required to generate the relevant output. On the other hand, in Contrastive Decoding, we sample from the distribution obtained by contrasting $P(y\mid c,x)$ with $P(y \mid x)$. The distribution $P(y \mid x)$ can be thought of as representation of the model's prior bias, since it ignores the key context $c$. By contrasting these two distributions, we suppress continuations that are likely under this biased prior alone and amplify those whose probability increases when $c$ is taken into account, effectively encouraging the model to focus on context-relevant information and produce more accurate, grounded outputs.

Inspired by the contrastive decoding for natural images \cite{crg}, we propose Category-Wise Contrastive Decoding (CWCD) for Radiology Report Generation. As seen in Fig.~\ref{fig:methods}, given a chest X-ray $I$ and corresponding \emph{category-specific} bounding boxes $b_{c_i}$, we mask all the pixels present within the regions covered by $b_{c_i}$ to get $I^{b}_{c_i} = mask(I,b_{c_i})$. We then do two forward passes through $\theta_{c_i}$ to obtain $P(r^t_{c_i}\mid I_{c_i}, u_{c_i}, r^{<t}_{c_i})$ and $P(r^t_{c_i}\mid I^{b}_{c_i}, u_{c_i}, r^{<t}_{c_i})$ called the \emph{base} and \emph{masked} probabilities respectively. Specifically, we contrast the base and masked log-probabilities using a weighted difference to define a distribution over the next token: 
%\ref{eq:cd_propto} and \ref{eq:cd_softmax}:
%\begin{equation}\label{eq:cd_propto}
%r^t_{c_i} \propto 
%\frac{P(r^t_{c_i} \mid I_{c_i}, u_{c_i}, r^{<t}_{c_i})^{1 + \alpha}}
%     {P(r^t_{c_i} \mid I'_{c_i}, u_{c_i}, r^{<t}_{c_i})^\alpha},
%\end{equation}
\begin{equation}\label{eq:cd_softmax}
CD(r^t_{c_i}) = 
\text{softmax}\Big[
(1+\alpha) \cdot \log P \big(r^t_{c_i} \mid I_{c_i}, u_{c_i}, r^{<t}_{c_i}\big)
- \alpha \cdot \log P\big(r^t_{c_i} \mid I^{b}_{c_i}, u_{c_i}, r^{<t}_{c_i}\big)
\Big].
\end{equation}

\begin{equation}
=\text{softmax}\Big[\log P\big(r^t_{c_i} \mid I_{c_i}, u_{c_i}, r^{<t}_{c_i}\big)
+ \alpha \log \frac{P\big(r^t_{c_i} \mid I_{c_i}, u_{c_i}, r^{<t}_{c_i}\big)}
{P\big(r^t_{c_i} \mid I^{b}_{c_i}, u_{c_i}, r^{<t}_{c_i}\big)}\Big].
\end{equation}

\noindent This shows that CWCD starts from the base distribution and adds a contrastive term
proportional to logarithm of the ratio between the base and masked probabilities,
%$\log \frac{P(\cdot \mid I_{c_i}, u_{c_i}, r^{<t}_{c_i})}%{P(\cdot \mid I^{b}_{c_i}, u_{c_i}, r^{<t}_{c_i})}$,
upweighting tokens whose probability increases when the category-specific region is visible
and downweighting those that remain likely even when it is masked. The weighting factor $\alpha$ determines how strongly the contrast affects the selection: increasing $\alpha$ amplifies the emphasis on differences between the base and masked distributions. The next token $r^t_{c_i}$ is chosen greedily based on the $\text{CD}(\cdot)$ scores. This token is then appended to both the base and masked sequences to compute the probabilities for the subsequent timestep. By operating in log-probability space (Eq. \ref{eq:cd_softmax}), the method preserves meaningful contrast even for tokens with low probability. 


\subsection{Plausibility-Based Vocabulary Subselection}

While Category-Based Contrastive Decoding effectively contrasts the base and masked distributions, applying it indiscriminately at every timestep can undesirably penalize tokens that both distributions assign high probability to. These are often common-sense tokens that satisfy basic grammatical or linguistic constraints, which can be generated even with a masked chest X-ray input. Such penalization can reduce the final probability of highly plausible tokens, potentially leading to unintended outputs. To address this, we employ a Plausibility-Based Vocabulary Subselection through an adaptive plausibility constraint, inspired by \citet{apc-og}. 

At each decoding step, we truncate the candidate token set based on the unmasked log-probabilities: only tokens whose probability exceeds a fraction $\beta$ of the maximum probability token in the current step are retained for softmax after contrasting. This ensures highly probable and linguistically apparent tokens are preserved. In contrast, implausible or low-probability tokens are excluded, resulting in a subselected vocabulary at each timestep over which the contrastive softmax is computed:
\begin{equation} \label{eq:vocab}
    V^t_{sub} =\{\forall r^t \in V: \text{log P}\big(r^t \mid I, u, r^{<t}\big) \geq \max_{r^t}\beta \cdot\text{log P}\big(r^t \mid I, u, r^{<t}\big)\}.
\end{equation}

% \vspace{5pt}
The overall category-based contrastive objective becomes:
\begin{equation}\label{eq:cd}
CD(r^t_{c_i}) = 
\text{softmax} \Bigg( 
\mathds{I}(r^t_{c_i}) \cdot 
\log \frac{P(r^t_{c_i} \mid I_{c_i}, u_{c_i}, r^{<t}_{c_i})^{1+\alpha}}
       {P(r^t_{c_i} \mid I^{b}_{c_i}, u_{c_i}, r^{<t}_{c_i})^{\alpha}} 
\Bigg),
\end{equation}
% \vspace{5pt}
\begin{equation}
\mathds{I}(r^t_{c_i}) =
\begin{cases}
1 & \text{if } r^t_{c_i} \in V^t_{\text{sub}}\\
-\infty & \text{otherwise}.
\end{cases}
\end{equation}

We use $\beta=0.50$ (ablation study in Sec. \ref{sec:beta}) and $\alpha = 1$ to balance the base and contrastive terms without overly suppressing plausible tokens, following \citet{crg}.

\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:methods}
  {\caption{An overview of CWCD framework for the ``Cardiovascular'' Anatomical category. The base log probability distribution is contrasted with the masked log probability distribution using Eq.\ref{eq:cd}. We then sample the highest probability token from the final distribution. This process repeats for each token in an auto-regressive form to obtain a Category report. Reports across all categories are aggregated to obtain a full structured report.}}
  {\includegraphics[width=0.95\linewidth]{Diagrams/Methods_fig2.pdf}}
  
\end{figure}