
\documentclass{midl} % Include author names
% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{mwe} % to get dummy images
%\usepackage{ctex} % 添加中文支持
\usepackage{graphicx} % 引入图形包以处理图像
\usepackage{graphicx}
\usepackage{enumerate}
\usepackage{multirow}
\usepackage{chngcntr}
\counterwithout{table}{section}
\usepackage{booktabs}
\usepackage{float}  
\jmlrvolume{-- Under Review}
\jmlryear{2025}
\jmlrworkshop{Full Paper -- MIDL 2025 submission}
\editors{Under Review for MIDL 2025}

\title[DiffRGenNet]{DiffRGenNet: Difference-aware Medical Report Generation}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Minghao Bian\nametag{$^{1,2}$}} \Email{bianminghao@mail.ustc.edu.cn}\\
\Name{Kun Zhang\nametag{$^{1,2}$}}\Email{kkzhang@ustc.edu.cn}\\
\Name{Dexin Zhao\nametag{$^{1,2}$}} \Email{dexinzhao@mail.ustc.edu.cn}\\
\Name{S Kevin Zhou\midljointauthortext{Zhou is the corresponding author.} \nametag{$^{1,2,3,4}$}} \Email{skevinzhou@ustc.edu.cn}\\
\addr $^{1}$ School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC)，Hefei Anhui, 230026, China \\
\addr $^{2}$ Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC, Suzhou Jiangsu, 215123, China \\
\addr $^{3}$ State Key Laboratory of Precision and Intelligent Chemistry, USTC, Hefei, Anhui 230026, China \\
\addr $^{4}$ Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS，Beijing, 100190, China
}

\begin{document}

\maketitle

\begin{abstract}
% Medical report generation is a critical task in the healthcare domain, aiming to automatically generate accurate and detailed diagnostic reports from medical images, thereby alleviating the substantial burden on radiologists. However, a significant challenge arises from the minimal variability among most X-rays of the same anatomical region, compounded by the existence of numerous X-rays of the same region taken at different time points for the same patient. The difficulty lies in capturing these subtle differences to generate precise reports. We propose the Diff-Aware Routing Attention Network (DiffRGenNet), a method designed to generate more granular reports by focusing on these differences. Initially, similar reports are retrieved through image retrieval, and the Feature Diff module is employed to identify discrepancies. Subsequently, the FlexiRoute Aggregation Module (FAM) dynamically orchestrates global and local dependencies, crafting an optimal routing path for each sample to select the most suitable report that describes the changes and connections in differences. Finally, the method leverages the consistency of classification information and the inconsistency of differences to align closely with the ground truth, distancing itself from noise and errors, and enhancing the learning capability for rare disease differences, thereby generating more accurate and refined reports. Experimental results demonstrate that this method outperforms existing approaches on the MIMIC-CXR and IU X-Ray datasets, showing significant improvements across multiple evaluation metrics, which validates its efficacy and potential.


Medical report generation is a critical task in healthcare, aiming to automatically produce accurate diagnostic reports from medical images, thereby alleviating the burden on radiologists. However, due to the high similarity among medical images of the same anatomical region and the substantial variations captured from the same region across different time points for individual patients, capturing these differences poses a significant challenge. We propose a {\bf Diff}erence-aware {\bf R}eport {\bf Gen}eration {\bf Net}work (DiffRGenNet), which retrieves similar reports through image search, identifies differences using the Feature Diff module, and dynamically orchestrates global and local dependencies via the FlexiRoute Aggregation Module to determine the optimal routing path for each sample, selecting the most suitable report to describe the variations and connections. Finally, by leveraging the consistency of classification information and the discrepancy information from the diff module, DiffRGenNet enhances the ability to learn differences in rare diseases, generating more precise reports. Experiments demonstrate that DiffRGenNet outperforms existing methods on the MIMIC-CXR and IU X-Ray datasets, confirming its effectiveness and potential.
% 医学报告生成是医疗领域中的一项关键任务，旨在从医学图像中自动生成准确且详细的诊断报告，减轻放射科医生撰写报告的沉重负担。（然而，当下同一个部位的大部分X-ray差异性较小，并且存在大量患者不同时间段拍摄的相同部位的X-ray）然而，由于同一部位的X-ray图像通常具有较高的相似性，且患者在不同时间拍摄的同一部位图像数量庞大，在大量这种数据下如何捕捉差异化来进行报告生成成为一个难点。我们提出了Diff-Aware Routing Attention Network(DiffRGenNet)，该方法旨在通过捕捉差异性来生成更加细粒度的报告。首先通过图像检索相近的报告，通过Feature Diff模块着重识别差异性，其次，将图像的差异性和联系通过我们设计的FlexiRoute Aggregation Module（FAM）模块来动态地调度全局和局部依赖，为每个样本制定最优路由路径，选择最理想的报告来描述差异性变化和联系。最后，利用分类的一致性信息和差异性的不一致信息来和真值接近，远离噪声和错误信息，并且增强了对罕见病差异性的学习能力，生成更准确精细的报告。实验结果表明，该方法在MIMIC-CXR和IU X-Ray数据集上的表现优于现有方法，在多项评价指标上均取得了显著提升，验证了其有效性和潜力。



\end{abstract}

\begin{keywords}
Report Generation, Multimodal Learning.
\end{keywords}

\section{Introduction}
Radiological reports serve as critical foundations for clinical diagnosis and treatment based on medical imaging. In recent years, the automatic generation of radiological reports using deep learning has gained popularity. However, most existing methods focus on generating reports for diseases themselves\cite{chen2020generating,li2024kargen} while neglecting individual variability and nuanced features among different conditions. %For instance, the subtle differences between Example 1 and Example 2 in Figure~\ref{fig:example} make it challenging to visually discern their respective disease states, leading to semantic errors and potentially misleading medical guidance if reports are generated directly from such images. 
Moreover, in clinical practice, there are multiple X-rays from the same patient at different time points, as well as a large number of X-rays capturing the same anatomical region across different patients. Therefore, it is key to capture these fine-grained differences to generate precise and detailed reports.
% 放射学报告是依据医学影像进行临床诊疗的重要依据。近年来，基于深度学习的放射学报告自动生成逐渐流行，然而这些方法大多关注疾病本身的报告生成，忽略了不同疾病之间的个体差异性和细节特征。例如例1之间和例2之间的差异变化较小，很难直观的观察出所属的疾病状态，直接对这些图像进行报告生成会带来语义错误和负面的医学指导。同时，在临床实践中，存在大量患者不同时期的X-ray，且同时期同部位的X-ray也较多。因此，捕捉差异性来生成细粒度的报告是十分必要的。

% 图1，在例子1中，是大部分正常的x—ray，两者报告也未发现差异和异常。在例子2中，左图是左肺nodular opacities，右图是中肺calcified density，两者的细微差距很小，难以辨别和发现。并且例1和例2之间的这些图像差异变化也很微小，不易察觉，但这正是我们需要重点关注的地方。

%大部分放射学报告生成没有更加细粒度的关注到差异变化，大部分同部位放射线照片差异性较小，，在大量这种数据下如何捕捉差异化来进行报告生成成为一个难点。在许多数据集中，某些疾病的样本数量较少，而大部分常见疾病以及没有疾病的情况则占据多数。这种正负样本的不平衡可能导致模型倾向于常见疾病，因此对于某些差异性的变化我们需要额外的关注，这些差异性的变化可能就是我们要找的疾病以及罕见病。因此捕捉差异性来生成更加细粒度的报告是本文关注的重点。
%\begin{figure}[!t]
%\floatconts
%  {fig:example}  {\vspace{-25pt}\caption{\small{Example 1 illustrates the comparison among the majority of normal reports, while Example 2 demonstrates the contrast between abnormal conditions of a patient across different time periods.The red font highlights the key sections in the report. }}\vspace{-20pt}}
  %{\caption{\small{Example 1 illustrates the comparison among the majority of normal reports, while Example 2 demonstrates the contrast between abnormal conditions of a patient across different time periods.}}}
  %{\includegraphics[width=\textwidth]{midl/example3.jpg}}
%\end{figure}

Image difference captioning involves describing the differences between pairs of similar images using natural language. Recent research has explored how to model reliable representations of changes under varying viewpoints~\cite{park2019robust,vo2019composing,huang2021image}. %These studies can be broadly classified into two approaches. The first approach, based on the DUDA model~\cite{park2019robust,vo2019composing}, precomputes pixel differences between images and directly feeds these differences into their models. The second strategy summarizes common attributes between two images based on feature similarity and then removes these attributes to explicitly infer change features~\cite{huang2021image}. 
In contrast, medical imaging does not need to account for viewpoint variations, focusing solely on the differences between images. %Most tasks compute differences through pixel-wise image comparison. We propose a comprehensive evaluation of input features against average embeddings by integrating three distinct similarity measurement methods, providing a multidimensional perspective on the differences. 
Researchers have developed methods to improve the accuracy of medical report generation (MRG), such as leveraging knowledge graphs~\cite{liu2021auto,xiang2024gmod}, and integrating large language models (LLMs)~\cite{liu2024bootstrapping,chen2024dia} for prompt generation~\cite{jin2024promptmrg}, designing auxiliary enhancement modules to improve generation outcomes. Some works frame MRG as an retrieval problem~\cite{endo2021retrieval,tao2024memory},  assisting generation by extracting top-$k$ features most similar to the input image. We integrate retrieval into our approach.However, features may include both relevant and irrelevant information and the irrelevant part interferes with the representation of image features. %Additionally, the similarity metrics used in retrieval processes are often coarse, making it challenging to filter out the most optimal reports, which might limit the overall generation efficacy. 

% In recent years,researchers have developed methods to improve the accuracy and comprehensiveness of medical reports. Mainstream approaches predominantly employ encoder-decoder architectures\cite{ji2021improving,mao6632deep}. Early studies focused on leveraging Convolutional Neural Networks (CNNs)\cite{mao6632deep,anderson2018bottom} to extract visual features from medical images. With the advent of the Transformer model, numerous works have incorporated various attention mechanisms to enhance generation performance\cite{velivckovic2017graph}. Recent research has further emphasized knowledge enhancement techniques, such as leveraging knowledge graphs\cite{liu2021auto,xiang2024gmod}, integrating large language models (LLMs)\cite{liu2024bootstrapping,chen2024dia} for prompt generation\cite{jin2024promptmrg}, designing auxiliary enhancement modules to improve generation outcomes. On the other hand, some works frame the medical report generation task as an image-text retrieval problem\cite{endo2021retrieval,tao2024memory},  assisting generation by extracting top-k features most similar to the input image. However, these features may include both relevant and irrelevant information, and the introduction of noise can interfere with the representation of image features. Additionally, the similarity metrics used in retrieval processes are often coarse, making it challenging to filter out the most optimal reports, which limits the overall generation efficacy to some extent.
% 近年来，自动报告生成（Medical Report Generation, MRG）因其在减轻放射科医生工作负担方面的潜力而受到广泛关注，相关研究层出不穷。主流方法主要采用编码器-解码器架构。早期研究侧重于利用卷积神经网络（CNN）提取图像的视觉特征。随着Transformer模型的提出，许多研究引入了多种注意力机制以提升生成性能。
% 近期研究进一步聚焦于知识增强技术，例如利用知识图谱、结合大语言模型（LLM）生成提示（Prompt）、设计辅助增强模块以及提取更细粒度的特征以提升生成效果。另一方面，部分工作将医疗报告生成任务视为图像-文本检索问题，通过提取与图像最接近的top-k特征来辅助生成，但这些特征可能包含相关或不相关信息，噪声的引入可能会干扰图像特征的表达。同时，检索过程中使用的相似性度量方法较为粗糙，往往难以筛选出最理想的报告，这在一定程度上限制了生成效果。


%Based on the aforementioned related work, we have gained an updated understanding of report generation. A fundamental insight is the necessity of learning differential features across medical images to better capture fine-grained characteristics. For instance, subtle pathological changes in a specific anatomical region across different time points for the same patient represent critical focal points for report generation. 

Despite some progress, there is still limited research focused on extracting {\it differential changes between radiological images and reports}. Furthermore, the inherent imbalance in disease distribution exacerbates the challenge of capturing differential variations, as rare diseases are underrepresented in training data, hindering the model's ability to reliably identify characteristic changes in these conditions. Existing models
%~\cite{ji2021improving}
% mao6632deep
predominantly trained on positive samples, exhibit a bias toward common diseases and fail to effectively discern subtle variations in rare diseases across different time points or patients. Additionally, most current models
%~\cite{xiang2024gmod}
%velivckovic2017graph,
rely on attention mechanisms within Transformer architectures, which struggle to dynamically balance global and local dependencies and often result in an inability to simultaneously capture global structures and local details, thereby impeding the selection of optimal reports to describe differential changes and correlations in retrieval.


% 尽管取得了一些进展，但旨在提取放射学图像与报告之间差异性变化的研究不是很多。首先，一个直观的经验是学习不同医学影像之间的差异性特征，使其能更好的捕捉更加细粒度的特征。例如，同一患者不同时期的影像，在某一部位发生了细微的病变，这种细微的差异性病变即报告生成的主要关注点。此外，疾病分布的不平衡进一步凸显了差异性变化的捕捉难度，因为罕见疾病在训练数据中的代表性不足，导致模型难以可靠地识别这些疾病的特征变化。现有模型如果仅基于正样本学习，会更加偏向于常见疾病，无法有效捕捉罕见病在不同时期或不同患者中的细微变化。再者，现有模型大部分以transformer架构的注意力机制来捕获特征，很难动态地调度全局和局部依赖，往往难以同时捕捉到全局结构和局部细节，导致在检索任务中难以选择最理想的报告来描述这些差异性变化和联系。
To fill this gap, we design a novel network, {\bf Diff}erence-aware {\bf R}eport {\bf Gen}eration {\bf Net}work ({\bf DiffRGenNet}), to leverage both differential and similar information across reports for generating more accurate and reliable medical reports through finer-grained global and local feature extraction. Specifically, building upon an encoder-decoder framework, DiffRGenNet retrieves $K$ reports most similar to the input image and employs the Flexible Aggregation Module (FAM) to dynamically select the optimal report for describing differential changes and correlations. The FAM module captures both global and local features, distinguishing between similar features (extracted via a classification branch to identify disease information) and differential features (extracted via a dedicated diff module to highlight variations between images). By contrasting positive and negative samples, the model aligns closely with ground truth while minimizing noise, thereby enhancing its ability to capture subtle variations in rare diseases. Extensive experiments on two MRG benchmarks demonstrate the effectiveness of our approach, achieving state-of-the-art results.
%in clinical efficacy (CE) performance on both datasets. 
Our contributions are summarized as follows:
% 为了解决这些问题，我们设计了Diff-Aware Routing Attention Network(DiffRGenNet)架构，旨在充分利用报告间的差异信息和相似信息，通过更细粒度的全局和局部细粒度特征捕获，生成更准确和可靠的报告。具体来说，我们基于编码器-解码器架构，检索K个与图像相近的报告，通过我们设计的FAM模块，选择最理想的报告来描述这些差异性变化和联系，动态的在全局和局部分别捕获与图像相似和diff的特征，相似特征通过分类分支获取到疾病信息，diff特征通过diff模块获取了图像与diff变化的信息，从而抓住重要信息，通过正负样本的对比学习来和真值接近 远离噪声，让模型提高对罕见病差异性的捕获能力。我们在两个MRG基准上的实验表明了所提出方法的有效性，该方法在两个数据集上都获得了SOTA CE性能。我们将贡献总结如下。
% \footnote{Random footnote are discouraged}:
%\begin{itemize}
%\item 
(i) We propose a novel network framework, DiffRGenNet, which utilizes contrastive learning with feature-differential negative samples to effectively capture nuanced variations and generate more fine-grained medical reports.
% 我们提出了一种新的网络方法DaRNet，基于差异性的负样本对比学习，有效的捕捉了差异性来生成更加细粒度的报告。
%基于检索的方法，通过设计的动态路由模块选择最理想的报告，动态捕获全局和局部特征。并且通过diff模块着重关注疾病变化的部分，通过对比学习来拉近相似报告，拉远有变化部分的报告，会关注到罕见病的情况。
%\item 
(ii) We design the Flexible Aggregation Module (FAM) to adaptively capture the most relevant global and local features for describing differential variations and their correlations. Further, we introduce a specific module to focus on disease-related changes.
% 设计了FAM动态路由模块，用于动态捕获最理想的全局和局部特征来描述这些差异性变化和联系。以及设计了diff变化模块，用来着重关注疾病变化部分。
%\item 
(iii) We demonstrate the superiority of DiffRGenNet through evaluations on two widely recognized benchmarks, achieving state-of-the-art (SOTA) performances on both datasets.
%我们通过两个流行的基准测试证明了DaRNet的优越性，它在两个数据集上都获得了SOTA CE性能。
%\end{itemize}

% \section{Related work}

% \subsection{Medical Report Generation}


% 由于医学报告生成（Medical Report Generation, MRG）与图像字幕生成任务具有相似性，大多数现有的MRG模型借鉴了图像字幕生成领域的编码器-解码器架构（例如，Xu等人，2015；Lu等人，2017；Ji等人，2021）。这些模型通常采用卷积神经网络（CNN）提取视觉特征[1,22,30]，并结合短期记忆网络（LSTM）[21,33,30]或循环神经网络（RNN）[1,22]生成报告或提取特定区域的特征[39]。然而，与图像字幕生成相比，MRG任务面临更大的挑战：医学报告通常篇幅较长，且医学图像中的临床异常比自然图像中的物体更难识别。 为应对这些挑战，研究人员提出了多种改进方法。例如，Chen等人（2020）和Yang等人（2023）引入了额外的存储模块，用于记录先前的相似模式，从而在解码过程中提供信息支持，显著提升了生成性能。随着Transformer技术的普及[27]，许多研究开始利用各种注意力机制优化模型性能[7,20,39]。此外，部分工作通过引入知识图谱[21,28,43]或存储块[3,4,40,41]来编码或学习领域知识及模式，进一步增强了模型对医学特征的理解和表达能力。
% 最近的研究将提示视为提高特定任务性能的重要工具。例如，Qin等人（2023）开发了一种自动生成医学提示的方法，以增强预训练视觉语言模型在医学对象检测方面的知识传递能力。Chenhao等人（2022）提出了一种利用自适应疾病分类生成提示的方法，以指导报告生成过程，确保模型能生成诊断正确的报告。

% 在过去的两年里，大型语言模型（LLMs）[18]凭借其广泛的知识库，在生成连贯、上下文相关以及更接近人类表达的反应方面展示了显著的能力。目前的方法主要依赖于区域图像特征的视觉提示，但这种方法可能难以捕捉到详细的疾病相关信息。尽管一些研究尝试训练疾病分类器并将其输出作为附加的文本提示[17]，但所提供的信息仍然较为稀疏，无法充分为疾病线索提供支持。

% \subsection{Image difference Caption}
% 图像差分字幕是用自然语言描述相似图像对之间差异的一项任务。在这项任务中，准确、全面地定位和描述相似图像的变化非常重要。
% 起初工作描述了两个对齐良好的图像之间的差异[6]，[20]。然而，由于视点变化，两幅图像在现实环境中通常是不对齐的[21]。
% 为此，最新研究探索了如何在视点变化下对变化的可靠表示进行建模。这些作品基本上可以分为两个维度。一种是基于DUDA模型提出的，第一步预先计算图像的像素差异，然后直接将这些差异暴露给它们的模型。例如，SRDRL AVS[8]首先测量了减差与图像对之间的相关性，以判断是否发生了变化。然后，它引入了词性信息来动态使用视觉信息。第二种策略是首先基于特征相似性总结两幅图像之间的共同属性，然后将其删除以明确推断变化的特征。先锋方法VAM[9]遵循了这一范式。此外，IFDC[24]首先用属性和位置特征增强对象特征，然后计算变化的特征。

% 然而大部分任务对于差异部分是通过计算图像的像素差得到的，我们通过集成三种不同的相似度度量方法来评估输入特征与平均嵌入之间的差异，这包括L2距离、余弦相似度和点积相似度。每种度量方法提供了独特的视角，从而全面地捕捉到向量之间的关系。
% 图像差异字幕生成部分任务在解决视点变化引起的问题，在医学影像中对于同样的X-ray不会存在视角的变化，效果也会显著提高，

%图像差分字幕是用自然语言描述相似图像对之间差异的一项任务。最新研究探索如何在视点变化下对变化的可靠表示进行建模。这些作品基本上可以分为两个维度。一种是基于DUDA模型提出的，第一步预先计算图像的像素差异，然后直接将这些差异暴露给它们的模型。第二种策略是首先基于特征相似性总结两幅图像之间的共同属性，然后将其删除以明确推断变化的特征。幸运的是，医学影像不必关注视点的变化，只需要关注图像之间的差异性即可。然而大部分任务对于差异部分是通过计算图像的像素差得到的，我们通过集成三种不同的相似度度量方法从不同角度全面的来评估输入特征与平均嵌入之间的差异。
\section{Method}
%\subsection{Problem Formulation}

Given a radiological image $I$, the model is required to generate a descriptive radiological report $\tilde{R} \;=\;\left\{ r_1, r_2, \ldots, r_{N_R} \right\}$, where $r_i$ represents a token in the report and $N_R$ is the length of the report. The recursive generation process can be formulated as $P(\tilde{R}|I) = \prod_{t=1}^T p({r_{t+1}}|r_1, r_2, \ldots, r_t, I)$. DiffRGenNet estimates $P(\tilde{R}|I)$ via a network, which primarily comprises three modules as in Figure~\ref{fig:method}: (i) the Feature Difference Module, discussed in Section~\ref{sec:dif};(ii)the FlexiRoute Aggregation Module, detailed in Section~\ref{sec:fam}; and (iii) the Neg-Pos Matching, outlined in Section~\ref{sec:match}.
% The model is trained to minimize the language modeling loss, which serves as the primary loss function:
% \begin{equation}
%     {\cal L}_{CE}(\theta)=-\sum_{t=1}^{N_R}\log(p_{\theta}(r_{t+1}|r_{1:t})).
% \end{equation}

% As shown in Figure 2, DiffRGenNet primarily comprises three modules: (1) the FlexiRoute Aggregation Module, detailed in Section 2.3; (2) the Feature Difference Module, discussed in Section 2.2; and (3) the Neg-Pos Matching, outlined in Section 2.4.
% 如图2，DaRNet主要包含3个模块，（1）FlexiRoute Aggregation Moudle，在2.3节介绍。（2）Feature Difference Module，将在2,2节介绍；(3)  Neg-Pos Matching,将在2.4节介绍

% 根据图2可以看出，DaRNet遵循主流编解码器架构 其中编码器提取图像I的视觉特征，解码器生成基于diff知识和疾病知识的报告R。
% 给定放射学图像I，需要该模型生成描述性放射学报告$\stackrel{\cdot}{R} \;=\;\left\{ r_1, r_2, \ldots, r_{N_R} \right\}$，其中ri是报告的标记NR是报告长度。递归生成过程可以公式化为$P(\overset{\sim}{R}|I) = \prod_{t=1}^T p(\overset{\sim}{r_{t+1}}|r_1, r_2, \ldots, r_t, I)$，模型经过训练以最小化，语言建模损失被用作主要损失：
% \begin{equation}
%     {\cal L}_{CE}(\theta)=-\sum_{t=1}^{N_R}\log(p_{\theta}(r_{t+1}|r_{1:t})).
% \end{equation}

% 基于以上，我们提出了一种名为DaRNet的新网络（如图1所示），以动态捕获图像的全局和局部的差异性和联系，生成更加细粒度的报告。
\raggedbottom 
\begin{figure}[!t]
\floatconts
  {fig:method}
  {\vspace{-25pt}\caption{\small{The architecture of DiffRGenNet. It integrates a feature difference module, a FlexiRoute aggregation module (FAM), and contrastive learning to generate more fine-grained and precise medical reports.}}\vspace{-20pt}}
  %{\caption{The architecture of DiffRGenNet integrates a feature difference module, a FAM dynamic routing module, and contrastive learning to generate more fine-grained and precise medical reports.}}
  {\includegraphics[width=0.9\linewidth]{RARD [自动保存的] (5).pdf}}
\end{figure}



\subsection{Feature Difference Module}
\label{sec:dif}
In MRG, each report meticulously describes the affected regions and associated symptoms of a patient, derived from identifying and characterizing abnormal areas in medical images. The differential metric mechanism ~\cite{tu2023neighborhood,tu2023adaptive} enhances the model's sensitivity to input features, enabling it to more effectively capture critical clinical information within the images, thereby improving the accuracy and reliability of the generated reports.
% 在医学报告生成任务中，每份报告详细描述了患者病变的部位及相关症状，这些症状是通过识别和描述影像中的异常区域得出的。差异性度量机制增强了模型对输入特征的敏感性，使其能够更有效地捕捉图像中的关键临床信息，从而提升报告的准确性和可靠性。

To effectively quantify these differences, this module compares image and text embeddings using three distinct metrics: L2 distance, cosine similarity, and dot product similarity. These metrics facilitate the quantification of both similarities and differences between input features and reference embeddings from multiple perspectives, enhancing the robustness of the model.
% 在多模态学习中，特别是在图像与文本的对比任务中，捕捉输入特征与平均嵌入之间的差异是至关重要的。为了有效地捕捉这种差异，本模块通过三种不同的度量方法对图像和文本的嵌入进行比较：L2距离、余弦相似度和点积相似度。这三种度量方法有助于从多个角度量化输入特征与参考嵌入之间的相似性和差异性。
The L2 distance, also known as the Euclidean distance, represents the straight-line distance between two vectors. In this task, the L2 distance is used to measure the difference between the input feature \( Z_k \) and the average embedding \( F \): 
$dif_{{L2}} = \|Z_k - F\|_2$
Cosine similarity measures the similarity in direction between two vectors, regardless of their magnitudes. Here, we use cosine similarity to quantify the similarity between the input feature and the average embedding:
$dif_{cos} = \frac{z_k F}{\|z_k\|_2 \|F\|_2} \in \mathbb{R}^{N \times 1}$
Dot product similarity measures the inner product of two vectors, reflecting the degree of overlap in the same direction:
$dif_{dot}=Z_kF$
These vectors are concatenated to obtain:
% L2距离，又称为欧几里得距离，表示两个向量之间的直线距离。在此任务中，L2距离用于衡量输入特征Z_k与平均嵌入F之间的差异$dif_{f_{12}} = \|Z_k - F\|_2$。余弦相似度用于衡量两个向量在方向上的相似度，而不考虑其大小。在这里，我们使用余弦相似度来量化输入特征和平均嵌入之间的相似性$dif_{f_{cos}} = \frac{z_k F}{\|z_k\|_2 \|F\|_2} \in \mathbb{R}^{N \times 1}$。点积相似度度量了两个向量的内积值，它反映了向量在相同方向上的重叠程度$diff_{dot}=Z_kF$。拼接这些向量得到：
\begin{equation}
dif = \text{MLP}(\text{concat}(dif_{L2}, dif_{cos}, dif_{dot})) \in \mathbb{R}^{N \times d_h}.
\end{equation}

\subsection{FlexiRoute Aggregation Module}
\label{sec:fam}
Transformer, renowned for its exceptional capability in modeling global dependencies, has been widely adopted in medical report generation. However, the challenge of dynamically balancing global and local dependencies within Transformer architectures remains unresolved. To address this, we propose the FlexiRoute Aggregation Module (FAM) as in Figure~\ref{fig:method}, which introduces a routing mechanism with a varying attention spanning at each layer of the vision Transformer. This module dynamically computes attention weights based on the output of previous step, enabling the generation of an optimal routing path for each sample. This approach significantly enhances the retrieval process by facilitating the selection of the most suitable report, thereby improving the overall system performance.
% Transformer凭借其卓越的全局依赖建模能力，已被广泛应用于报告生成任务中，以聚焦关键区域。然而，如何更有效地动态调度Transformer中的全局与局部依赖关系，仍是一个亟待解决的问题。为此，我们设计了动态路由模块（FlexiRoute Aggregation Module, FAM），在每一层视觉Transformer中引入了具有不同注意力广度的路由机制。该模块能够根据前一步的输出动态计算相应的注意力权重，从而为每个样本生成最优的路由路径，为检索阶段筛选出最理想的报告提供了有力支持。
% \begin{figure}[!t]
% \floatconts
%   {fig:FAM}
%   {\vspace{-25pt}\caption{\small{The layer of FAM}}\vspace{-20pt}}%{\caption{The layer of FAM}}
%   {\includegraphics[width=0.2\linewidth]{midl/1737655821088.jpg}}
% \end{figure}

In the FlexiRoute Aggregation Module, feature embeddings are processed through multiple Dynamic Routing Attention (DRA) layers, computed as follows:
% 我们提出了动态路由Transformer，它通过对不同特征的层次共注意力进行路由，捕捉特征的一致性和不一致性，能够适应不同的医学图像输入。在动态路由变换器中，我们将特征嵌入输入到多个DRA层中，计算公式如下：
\begin{equation}
Z_k=DRA(Z_{k-1},F),k\in[1,K],
\end{equation}
where \( Z_k \) represents the output of the \( k \)-th DRA layer, \( Z_0 = Z \) denotes the input to the first layer, \( K \) is the maximum index of DRA layers, and the output \( Z_k \) of the final DRA layer constitutes the ultimate routed features.
% 其中，$Z_k$是第k层DRA的输出，$Z_0=Z$是第一层的输入，K是DRA层的最大索引，最后一层DRA的输出$Z_k$是最终的路由特征。
In contrast to prior dynamic methods such as TRAR~\cite{zhou2021trar}, which perform routing on a single feature's attention grid, our DRA layers route hierarchical co-attention across both image and text features, conditioned on the specific input. Each DRA layer consists of a Multi-Head Co-Attention Routing (MHCAR) module, a Multi-Head Self-Attention (MHA) module, and a Feed-Forward Network (FFN), with each module followed by a residual connection and a layer normalization (LN). The $k$-th DRA layer can be expressed as:
% 与先前的动态方法TRAR不同，后者在单一特征的注意力网格上执行路由，我们的DRA层在图像和文本的层次共注意力上进行路由，且根据不同的输入进行条件化。我们的DRA层由一个多头共注意力路由（MHCAR）模块、一个多头自注意力（MHA）模块和一个前馈网络（FFN）组成，其中每个模块后面跟着一个残差连接和一个归一化层（LN）。第k层DRA可以表示为：
\begin{equation}
\begin{aligned}
Z_{k-1}^r &= \mathrm{LN}(\mathsf{MHCAR}_k(Z_{k-1},F) + Z_{k-1}),\\
Z_{k-1}^a &= \mathrm{LN}(\mathsf{MHA}_k(Z_{k-1}^r) + Z_{k-1}^r), \\
Z_k &= \mathrm{LN}(\mathsf{FFN}_k(Z_{k-1}^a) + Z_{k-1}^a),
\end{aligned}
\end{equation}
where \( k \in [1, K] \) denotes the index of the DRA layer, \( Z_k \in \mathbb{R}^{n \times d_t} \) represents the output of the \( k \)-th DRA layer, and \( Z_{k-1}^r \) and \( Z_{k-1}^a \) are the outputs of the MHCAR module and the MHA module, respectively.
% 其中，$k\in[1,K]$是DRA层的索引，$Z_k\in{R^{n×d_t}}$是第k层DRA的输出，$Z_{k-1}^r$和$Z_{k-1}^a$分别是MHCAR模块和MHA模块的输出。
In the \( k \)-th DRA layer, the MHCAR module performs an \( h \)-head attention function, computing the hidden dimension \( d_h \) (where \( d_h = d_t / h \)) in parallel for each head. The results from these heads are concatenated and then projected to produce the final output of the MHCAR module. This process can be formulated as:
% 在第k层DRA中，MHCAR执行h头的注意力函数，并行计算每个头的隐藏维度$d_h$（其中$d_h  = d_t/h$），这些头的结果被连接后投影，从而得到MHCAR的最终值，计算公式为：
\begin{equation}
\text{MHCAR}_k (Z_{k-1}, F) = \text{concat}([head_i^k]_{i=1}^h) O_T^k,
\end{equation}
where \(\text{concat}(\cdot)\) denotes the concatenation operation, \( O_T^K \in \mathbb{R}^{d_t \times d_t} \) is the projection matrix, and each head \( \text{head}_{i}^{k} \in \mathbb{R}^{n \times d_{h}} \) is computed by the Co-Attention Routing (CAR) function, formulated as:
% 其中，\text{concat}(·) 是连接操作， $O_T^K\in{R^{d_t×d_t}}$是投影矩阵，每个头$head_{i}^{k} \in \mathbb{R}^{n \times d_{h}}$由共注意力路由（CAR）函数计算，公式为
\begin{equation}
%\begin{align}
head_i^k = \text{CAR}_i^k (Z_{k-1}, F) = \sum_{j=0}^{p_{k-1}} \alpha_j^k CA_{i,j}^k (Q_{i,j,k}, K_{i,j,k}, V_{i,j}^k, A^j)\\
= \sum_{j=0}^{p_k-1} \alpha_j^k \sigma \left( \frac{Q_{i,j,k} K_{i,j,k}}{\sqrt{d_h}} \otimes A^j \right) V_{i,j}^k,
%\end{align}
\end{equation}
where \(\sigma(\cdot)\) is the softmax function, \(\alpha_j^k\) is the routing probability weight for the \(j\)-th co-attention function, \(A^j\) is a co-attention mask between the two features, and \(Q_{i,j,k}\) and \(K_{i,j,k}\) are the attention matrices between the two features for the \(\text{head}_i^k\). Here, \(Q_{i,j,k} = Z_{k-1} W_{i,j,k}^Q\), \(K_{i,j,k} = F W_{i,j,k}^K\), and \(V_{i,j,k} = F W_{i,j,k}^V\), where \(W_{i,j,k}^Q \in \mathbb{R}^{d_f \times d_h}\), \(W_{i,j,k}^K \in \mathbb{R}^{d_f \times d_h}\), and \(W_{i,j,k}^V \in \mathbb{R}^{d_f \times d_h}\) are parameter matrices, and \(\otimes\) denotes element-wise matrix multiplication.
% 其中，σ(·)表示softmax函数，$α_j^k$是第j个CA函数的路由概率权重，$A^j$是两个特征之间的一种共注意力掩码，$Q_{i,j,k} K_{i,j,k}$是第$\text{head}_i^k$中两个特征之间的注意力矩阵。在这里$Q_{i,j,k} = Z_{k-1} W_{i,j,k}^Q,  K_{i,j,k} = FW_{i,j,k}^K,  V_{i,j,k} = FW_{i,j,k}^V,$ \text{其中} \,$ W_{i,j,k}^Q \in \mathbb{R}^{d_f \times d_h}, W_{i,j,k}^K \in \mathbb{R}^{d_f \times d_h}$, $W_{i,j,k}^V \in \mathbb{R}^{d_f \times d_h} \, \text{是参数矩阵}, \, \otimes \, \text{表示元素级的矩阵乘积}.$ \,

We describe the construction of the co-attention mask matrix \( A^j \), which restricts the relevant regions that image features can attend to within the co-attention function. Using an \( s \)-order sliding window, a patch of size \( (2s+1) \times (2s+1) \) traverses each block of the image, generating a mask vector \( v_l^s \in \mathbb{R}^m \) (where \( l \in [1, m] \)). The matrix \( A^s \) is constructed by cyclically stacking the vector \( v_l^s \) \( n \) times (where \( n \) is the token length):

% 我们描述如何构建共注意力掩码矩阵 $A^j$。$A^j$限制了图像特征在共注意力函数中能够看到的相关区域。一个s阶滑动窗口，大小为 (2s+1)×(2s+1)的小块，遍历图像的每个块，得到掩码向量 $v_l^s\in{R^m} （l\in[1,m]）$。我们通过将向量$v_l^s$循环叠加n次（n是token的长度）来构造$A^s$：
\begin{equation}
A^s  = [v_l^s,v_l^s,...,v_l^s] \in R^{n×m}.
\end{equation}

Specifically, \( A^0 \) is an empty mask matrix, i.e., a matrix filled with ones, allowing words or the global token [CLS] to attend to the entire image. To progressively model the consistency between different feature pairs, we design a hierarchical co-attention mechanism by incrementally increasing the number of DAR layers, thereby diversifying the types of co-attention masks. In the \( k \)-th DAR layer, the set of co-attention mask matrices that the router can route is defined as: 
$G_k  = [A^0,A^1,...,A^{p_k-1}]$, where \( p_k \) denotes the number of mask matrices in the \( k \)-th DAR layer.
% 具体地，$A^0$是一个空的掩码矩阵，即一个全为1的矩阵，这使得词语或全局token [CLS] 能够看到整个图像。为了逐渐建模不同特征对之间的一致性，我们设计了层次共注意力机制，通过逐步增加DAR层来使共注意力掩码的种类逐渐多样化，在第k层DAR中，路由器可以路由的共注意力掩码矩阵组为：$G_k  = [A^0,A^1,...,A^(p_k-1)]$。其中， $p_k$是第k层DAR中的掩码矩阵数量。
The routing probabilities \( \alpha_k = [\alpha_k^0, \alpha_k^1, \dots, \alpha_k^{p_k - 1}] \) for the \( k \)-th DAR layer can be obtained by the router based on the input conditions. The calculation formula is as follows:
% 第k层DAR的路由概率$α_k=[α_k^0,α_k^1,…,α_k^{p_{k-1}} ]$ 可以通过路由器根据输入条件获得，其计算公式为：
\begin{equation}
{\alpha}_k=\sigma_g (\text{MLP} (\text{APool} (F)))~\in {R^{p_k}},
\end{equation}
where \( \sigma_g(\cdot) \) is the Gumbel-Softmax function with temperature \( t \), \( \text{APool}(\cdot) \) denotes the 1D adaptive average pooling over all patch embeddings in the image, MLP is a two-layer multi-layer perceptron with hidden dimension \( d_m \).%, and \( p_k \) represents the number of co-attention mask matrices in the \( k \)-th DAR layer.

% 其中，$\sigma_g$ (·)是Gumble Softmax函数，温度为  t ，$\text{APool}$ (·)是对图像中所有块嵌入的1D自适应平均池化，MLP是一个具有隐藏维度$d_m$的两层多层感知机，$p_k$也是第k层DAR中共注意力掩码矩阵的数量。

\subsection{Neg-Pos Matching}
\label{sec:match}
Unlike most existing studies which typically reinforce high-relevance segments by associating cross-modal shared semantics while weakening or even ignoring the impact of mismatched segments,our work transcends the limitation of solely focusing on enhancing attention to matched segments. We employ supervised contrastive learning (SCL) to simultaneously align both similar and dissimilar segments, thereby more comprehensively capturing cross-modal semantic relationships.
% 与大多数现有研究不同，我们的工作突破了仅关注增强匹配片段注意力的局限，这些研究通常通过关联跨模态共享语义来强化高相关性片段，而削弱甚至忽略不匹配片段的影响。传统方法主要依赖匹配片段（即高相关性的单词或区域）来衡量相似性，同时低估或不考虑不匹配片段（即低相关性的单词或区域）的作用。相比之下，我们通过监督对比学习，同时匹配相似和不相似的片段，从而更全面地捕捉跨模态语义关联，提升模型的鲁棒性和泛化能力。

\textbf{SCL Loss}. The objective of SCL is to learn useful representations of data by maximizing the similarity between positive samples while minimizing the similarity between negative samples. In SCL, the model learns representations by comparing pairs of samples (anchor samples, positive samples, and negative samples). Specifically, given an anchor sample, the goal is to make it more similar to positive samples and less similar to negative samples. This is achieved by computing similarity scores between samples and applying a variant of the contrastive loss function. In this task, we partition the representations in each batch into multiple subsets based on whether they share the same sample label. Then, for each subset, the representations within the subset serve as positive samples, while those from other subsets act as negative samples.
% \textbf{监督对比学习损失}。监督对比学习（Supervised Contrastive Learning）损失是一种用于训练表示学习模型的损失函数。它的目标是通过最大化正样本间的相似性，同时最小化负样本间的相似性，来学习数据的有用表示。在监督对比学习中，模型通过一对样本（锚定样本和正负样本）的比较来学习表示。具体来说，给定一个锚定样本，目标是使其与正样本更相似，同时与负样本更不相似。这可以通过计算样本之间的相似度得分，并应用一些变体的对比损失函数来实现。在本任务中，我们根据它们是否是相似样本标签将每个批次中的表示分成多个子集。然后，对于每个子集，该子集内的表示充当正样本，而另一个子集中的表示充当负样本。
\begin{equation}
\mathcal{L}_{SCL}=-\frac{1}{N}\sum_{i=1}^N\log\left(\frac{e^{f(x_i,x_i^+)}}{\sum_{j=1}^Ke^{f\big(x_i,x_j^-\big)}}\right),
\end{equation}
where \( N \) is the number of samples in the batch, \( x_i \) is the anchor sample, \( x_i^+ \) is the positive sample, \( x_{i_j^-} \) is the \( j \)-th negative sample, \( f(x, y) \) is the mapping function that projects samples \( x \) and \( y \) into the latent space, and \( \mathcal{L}_{\text{SCL}} \) is the SCL loss.

% 其中，N 是批次中样本的数量。$x_i$是锚定样本。$x_i^+$是正样本。$x_{i_j^- }$是负样本的第 j 个样本。f(x,y)是表示学习模型的映射函数，它将样本x和y投影到潜在空间。$L_SCL$是监督对比学习损失。
\textbf{Disease Classification Loss}. Inspired by the approach of PromptMRG\cite{jin2024promptmrg}, an algorithm that adaptively adjusts learning objectives based on the learning states of different diseases, we introduce the logit-adjusted loss \cite{menon2020long} to balance learning across diseases. This loss encourages the model to focus more on rare diseases by reducing their logits during optimization. For a given disease \( D \), the logit-adjusted loss for the positive label \( P \) is formulated as:
% \textbf{疾病分类损失}。我们借鉴PromptMRG工作的思路，这是一种根据不同疾病的学习状态自适应调整其学习目标的算法。为了平衡疾病之间的学习，我们引入了logit调整损失（Menon等人，2020），该损失通过在优化过程中降低logit来鼓励罕见疾病学习更多。对于给定的疾病D，其对标签P（即阳性）的logit调整损失公式为
\begin{equation}
%\ell_D
\mathcal{L}_{\rm SDL}(y=P,f(\boldsymbol{x}^E))=-\log\frac{e^{f_y(\boldsymbol{x}^E)+\log\pi_D}}{\sum_{y'\neq P}e^{f_y'(\boldsymbol{x}^E)} + (e^{f_y(\boldsymbol{x}^E)+\log\pi_D})}.
\end{equation}

\textbf{Total Training Loss.}
The language modeling loss is used as the primary loss:
% \textbf{训练总损失}
% 语言建模损失被用作主要损失：
%\begin{equation}
$\mathcal{L}_{\mathrm{LM}}=-\sum_{t=1}^T\log p(r_t|r_1,...,r_{t-1},X,d_1,...,d_L)$.
%\end{equation}
The total training loss for our model is:
% 我们模型的总训练损失为
%\begin{equation}
    $\mathcal{L} = \mathcal{L}_{\rm LM} + \lambda \mathcal{L}_{\rm SDL} + \gamma \mathcal{L}_{\rm SCL}$.
%\end{equation}
\section{Experiments}
\subsection{Datasets and Metrics}
\textbf{Datasets:} We validate the proposed method using two public datasets: MIMIC-CXR and IU X-Ray.
% MIMIC-CXR 和 IU X-Ray 是两个用于医学报告生成（MRG）评估的广泛使用的数据集。
\textbf{MIMIC-CXR}
\cite{johnson2019mimic} is currently the largest dataset containing chest X-ray images paired with corresponding textual reports. This dataset includes 377,110 chest X-ray images and 227,835 free-text radiology reports. Following the official split and the preprocessing steps, %outlined by Chen et al. (2020), 
the resulting training, validation, and test sets contain 270,790, 2,130, and 3,858 samples, respectively. 
% (Johnson et al. 2019) 是目前最大的包含胸部X光片及对应文本报告的数据集。该数据集包含377,110张胸部X光片和227,835份自由文本的放射学报告。我们遵循官方划分方案和 Chen et al. (2020) 的预处理步骤，得到的训练集、验证集和测试集分别包含270,790、2,130和3,858个样本。
%\item
\textbf{IU X-Ray}
\cite{demner2016preparing} is another commonly used public dataset. %for medical report generation. 
This dataset contains 7,470 X-ray images (including both frontal and lateral views) and 3,955 radiology reports. The dataset is divided into training, validation, and test sets in a 7:1:2 ratio. However, due to the limited number of positive samples for certain diseases, the original test split is not ideal for disease-specific evaluation. Therefore, we evaluate the entire IU X-Ray dataset using a model trained on the MIMIC-CXR training set~\cite{jin2024promptmrg}.
% (Demner-Fushman et al. 2016) 是另一个常用的公共数据集，适用于医学报告生成评估。，包含7,470张X光片（包括正面和侧面视图）及3,955份放射学报告。我们参考 Chen et al. (2020) 的划分方案，将整个数据集按7:1:2的比例分为训练集、验证集和测试集。然而，由于某些疾病仅有少量阳性样本，原始的测试划分对于特定疾病的评估并不理想，因此我们使用在 MIMIC-CXR 训练集上训练的模型直接对整个 IU X-Ray 数据集进行评估。

\textbf{Evaluation metrics:}
To evaluate model performance, we employ both Natural Language Generation (NLG) metrics and Clinical Efficacy (CE) metrics. \textbf{NLG metrics} include BLEU, METEOR, and ROUGE-L. Specifically, BLEU and METEOR were proposed for machine translation evaluation, while ROUGE-L is designed to assess the quality of summaries. \textbf{CE metrics} are used to evaluate the clinical validity of generated reports. We apply CheXBert\cite{smit2020chexbert} to tokenize the generated reports and compute precision, recall, and F1 scores based on the predicted labels.
% 在评价模型性能时，我们采用了自然语言生成（NLG）指标和临床有效性（CE）指标。以全面衡量生成报告的质量。\textbf{自然语生成（NLG）指标}包括BLEU[30]、METEOR[31]和ROUGE-L[32]。特别是，BLEU和METEOR被提出用于机器翻译评估。ROUGE-L旨在评估摘要的质量。\textbf{临床有效性（CE）指标}是为评估生成报告的临床有效性，我们采用了由R2Gen (Chen et al., 2020) 提出的CE指标，我们应用CheXBert[18]对生成的报告进行标记，并通过预测的标签计算精确度、召回率和F1。
%\begin{enumerate}[1)]

%\item 
% \textbf{自然语生成（NLG）指标}

% 为了评估生成报告的语言质量，我们使用了几种常用的NLG指标，包括BLEU、METEOR、CIDEr和ROUGE-L。

% \textbf{BLEU} ：通过计算候选句子与参考句子之间n-gram的几何平均分来评估生成句子的质量，通常采用BLEU-4作为标准。

% \textbf{METEOR} (Denkowski and Lavie 2011)：通过直接词序、词形变化、同义词和释义等方面，计算生成句子与参考句子之间的相似度得分。

% \textbf{ROUGE-L} (Lin 2004)：常用于文本摘要领域，通过寻找参考句子和生成句子之间最长公共子序列的长度，计算F-measure得分。

% \textbf{CIDEr} (Vedantam et al. 2015)：通过人类共识评估生成句子与参考句子之间的相似性。
% 这些指标能够有效衡量生成报告与真实报告在语言上的相似度和匹配度。

% %\item
% \textbf{临床有效性（CE）指标}
% 为评估生成报告的临床有效性，我们采用了由R2Gen (Chen et al., 2020) 提出的CE指标，包括精确率（precision）、召回率（recall）和F1得分。这些指标通过将报告转换为14种疾病分类标签进行评估。具体方法是利用CheXbert (Smit et al. 2020) 模型对生成报告进行标注，然后与真实标签进行比较。
%\end{enumerate}

\textbf{Implementation:}
% 我们的方法采用ImageNet预训练的ResNet-101模型（He et al., 2016）作为编码器，并利用Bert-base（Devlin et al., 2019）作为解码器。优化器选用AdamW（Loshchilov and Hutter, 2017），权重衰减率设置为0.05。初始学习率设定为5e-5，并采用余弦退火策略进行动态调整。在方程式（12）中，λ是通过训练动态改变的，$\gamma$设置为0.1。模型训练共进行10个周期，批大小设置为16。
Our method employs an ImageNet-pretrained ResNet-101 model as the encoder and utilizes Bert-base as the decoder. The optimizer of choice is AdamW, with a weight decay rate set to 0.05. The initial learning rate is set to \(5 \times 10^{-5}\) and dynamically adjusted using a cosine annealing strategy. In Eq.~(12), \(\lambda\) is dynamically adjusted during training, and \(\gamma\) is set to 0.1. The model is trained for 10 epochs with a batch size of 16.
\subsection{Results}
We conduct a comprehensive comparison of our proposed model with state-of-the-art (SOTA) methods, including R2Gen~\cite{chen2020generating}, M2TR~\cite{nooralahzadeh2021progressive}, MKSG ~\cite{yang2022knowledge}, CliBert~\cite{yan2022clinical}, M2KT~\cite{yang2023radiology}, METrans~\cite{wang2023metransformer}, KiUT~\cite{huang2023kiut}, DCL~\cite{li2023dynamic},RGRG~\cite{tanida2023interactive},HSA~\cite{zhu2024multivariate} as well as PromptMRG\cite{jin2024promptmrg} and MAN\cite{shen2024automatic}. Table~\ref{tab:sota} presents the experimental results on the MIMIC-CXR and IU X-Ray datasets. From the table, it can be observed that our method achieves SOTA performance across all three Clinical Efficacy metrics on both datasets.Compared with the state-of-the-art PromptMRG framework, our method achieves a nearly 1-point improvement in CE metrics and a 3-point enhancement in ROUGE-L.
 This demonstrates the model's superior ability to capture positive samples and balance precision and recall. Additionally, learning from negative samples further enhances the model's ability to identify positive samples. However, there is still room for improvement in NLG metrics, particularly in generating long and complex sentences, which requires further optimization in future work.
% 我们将所提出的模型与当前最先进的（SOTA）方法进行了全面比较，包括R2Gen（Chen等人，2020）、M2TR（Nooralahzadeh等人，2021）、MKSG（Yang等人，2022）、CliBert（Yan和Pei，2022）、M2KT（Yang等人，2023）、METrans（Wang等人，2023）、KiUT（Huang、Zhang和Zhang，2023）、DCL（Li等人，2023）以及RGRG（Tanida等人，2023），此外还包括PromptMRG和MAN。表1展示了在MIMIC-CXR和IU X射线数据集上的实验结果。从表中可以看出，我们的方法在两个数据集的三个临床有效性（CE）指标上均达到了SOTA性能，相比于目前最强大的PromptMRG，在CE指标上有几乎1个点的提升，在ROUGE-L指标上有三个点的提升，表明模型在捕捉正样本以及平衡精确率与召回率方面表现优异。此外，对负样本的学习进一步增强了模型对正样本的识别能力。然而，在自然语言生成（NLG）指标上，模型的表现仍有提升空间，特别是在生成长且复杂的句子时，未来需要进一步优化。
\begin{table}[!t]
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{llcccccccc}
\toprule
\multirow{2}{*}{Dataset} & \multirow{2}{*}{Model} & \multirow{2}{*}{Year} & \multicolumn{3}{c}{CE Metrics} & \multicolumn{4}{c}{NLG Metrics} \\
\cmidrule(lr){4-6} \cmidrule(lr){7-10}
 & & & Precision & Recall & F1 & BLEU-1 & BLEU-4 & METEOR & ROUGE-L \\
\midrule
\multirow{7}{*}{MIMIC} & R2Gen & 2020 & 0.333 & 0.273 & 0.276 & 0.353 & 0.103 & 0.142 & 0.277 \\
 & M2TR & 2021 & 0.240 & 0.428 & 0.308 & 0.378 & 0.107 & 0.145 & 0.272 \\
 & MKSG & 2022 & 0.458 & 0.348 & 0.371 & 0.363 & 0.115 & - & 0.284 \\
 & CliBert & 2022 & 0.397 & 0.435 & 0.415 & 0.383 & 0.106 & 0.144 & 0.275 \\
 & M2KT & 2023 & 0.420 & 0.339 & 0.352 & 0.386 & 0.111 & - & 0.274 \\
 & METrans. & 2023 & 0.364 & 0.309 & 0.311 & 0.386 & 0.124 & 0.152 & \textbf{0.291} \\
 & KIUT & 2023 & 0.371 & 0.318 & 0.321 & 0.393 & 0.113 & 0.160 & 0.285 \\
 & DCL & 2023 & 0.471 & 0.352 & 0.373 & - & 0.109 & 0.150 & 0.284 \\
 & RGRG & 2023 & 0.461 & 0.475 & 0.447 & 0.373 & \textbf{0.126} & \textbf{0.168} & 0.264 \\
 & MAN & 2024 & 0.411 & 0.398 & 0.389 & 0.396 & 0.115 & 0.151 & 0.274 \\
  & HSA & 2024 & 0.480 & 0.357 & 0.379 & 0.386 & 0.120 & 0.163 & 0.288 \\
 & PromptMRG & 2024 & 0.501 & 0.509 & 0.476 & 0.398 & 0.112 & 0.157 & 0.268 \\
\cmidrule(lr){2-10}
 & \textbf{DiffRGenNet (ours)} & \textbf{-} & \textbf{0.512} & \textbf{0.513} & \textbf{0.483} & \textbf{0.402} & 0.119 & 0.163 & 0.275 \\
\midrule
\multirow{6}{*}{IU X-Ray} & R2Gen\textsuperscript{†} & 2020 & 0.141 & 0.136 & 0.136 & 0.325 & 0.059 & 0.131 & 0.253 \\
 & CVT2Dis.\textsuperscript{†} & 2022 & 0.174 & 0.172 & 0.168 & 0.383 & 0.082 & 0.147 & 0.277 \\
 & M2KT\textsuperscript{†} & 2023 & 0.153 & 0.145 & 0.145 & 0.371 & 0.078 & 0.153 & 0.261 \\
 & DCL\textsuperscript{†} & 2023 & 0.168 & 0.167 & 0.162 & 0.354 & 0.074 & 0.152 & 0.267 \\
 & RGRG\textsuperscript{†} & 2023 & 0.183 & 0.187 & 0.180 & 0.266 & 0.063 & 0.146 & 0.180 \\
  & PromptMRG\textsuperscript{†} & 2024 & 0.213 & 0.229 & 0.211 & 0.401 & 0.098 & 0.160 & 0.281 \\
\cmidrule(lr){2-10}
 & \textbf{DiffRGenNet (ours)} & \textbf{-} & \textbf{0.216} & \textbf{0.230} & \textbf{0.213} & \textbf{0.417} & \textbf{0.104} & \textbf{0.167} & \textbf{0.309} \\
\bottomrule
\end{tabular}
}
{\vspace{-5pt}\caption{\small{Comparison with MRG methods on MIMIC-CXR and IU X-Ray datasets. %∗ indicates the used image size is larger than 224. 
`†' indicates the performance evaluated by us. The best results are in bold.}}\label{tab:sota}\vspace{-20pt}}
% \label{tab:sota}
\end{table}



% \begin{figure}[H]
% \floatconts
%   {fig:example}
%   {\caption{Model}}
%   {\includegraphics[width=1\linewidth]{midl/layer.png}}
% \end{figure}

\subsection{Model Analysis}

\textbf{Ablation study:}
To validate the effectiveness of each module, we conduct ablative experiments on the MIMIC dataset. The results as in Table~\ref{tab:ablation} indicate that removing the diff module leads to a slight performance degradation, while removing the contrastive learning module results in a significant performance drop, highlighting the critical role of diff negative sample learning in report generation. Additionally, the removal of the FAM module also causes a noticeable decline in performance. Overall, each module contributes positively to the MRG task, validating the rationality and necessity of their design.
% 为了验证各模块的有效性，我们在MIMIC测试集上进行了消融实验。实验结果表明，移除diff模块后，模型性能出现轻微下降；而移除对比学习模块后，性能下降尤为显著，这凸显了diff负样本学习在报告生成中的关键作用。此外，去除FAM模块后，模型性能也出现了一定程度的下降。总体而言，每个模块都对医学报告生成（MRG）任务起到了积极的提升作用，验证了其设计的合理性和必要性。
\begin{table}[!t]
\centering
\begin{tabular}{lccccc}
\toprule
\textbf{Model} & \textbf{B-1} & \textbf{B-4} & \textbf{M} & \textbf{R-L} & \textbf{F1} \\
\midrule
DiffRGenNet (ours)                  & 0.402 & 0.119 & 0.163 & 0.275 & 0.483 \\
w/o Diff Prompt       & 0.401 & 0.115 & 0.160 & 0.269 & 0.481 \\
w/o DAR w Transformer & 0.400 & 0.112 & 0.159 & 0.268 & 0.477 \\
w/o SCL               & 0.386 & 0.107 & 0.148 & 0.277 & 0.382 \\
\bottomrule
\end{tabular}
{\vspace{-5pt}\caption{\small{Ablation study of each module on MIMIC dataset.}}\label{tab:ablation}\vspace{-10pt}}
% \label{tab:ablation}
\end{table}


\textbf{Qualitative results:}
We present a qualitative example to demonstrate the superiority of DiffRGenNet over the baseline. As in Figure~\ref{fig:qualitative}, red text highlights key descriptions in the report, purple text indicates errors, and shaded text represents the differential changes of interest. Our method accurately generates a report consistent with the ground truth. It correctly assesses both normal and abnormal conditions, with particular attention to changes in abnormalities. For instance, the baseline method~\cite{jin2024promptmrg} incorrectly generates ``the small pleural effusions," which is not present in the ground truth, and provides an imprecise description of ``mild-to-moderate" for pulmonary edema. The Appendix presents more experimental results. 
% 我们展示了一个定性示例，以证明DaR优于baseline。如图4所示，红色字体代表报告的关键描述，紫色字体代表有错误的地方，阴影字体代表关注的差异变化部分。我们的方法准确无误地生成了与ground-truth一致的报告。它准确地评估了他们的正常和异常情况，对于异常的变化情况特别进行了关注说明。例如，baseline方法生成了ground-truth没有的the small pleural effusions ，并且对于pulmonary edema的描述，baseline不够准确mild-to-moderate。

\begin{figure}[!t]
\floatconts
  {fig:qualitative}
  {\vspace{-25pt}\caption{\small{Qualitative examples of the baseline~\cite{jin2024promptmrg} and the proposed method. Red  indicates consistent content with the ground-
truth while purple indicates incorrect one. }}\vspace{-20pt}}
  {\includegraphics[width=\linewidth]{midl/rebutalyuanexample.png}}
\end{figure}


% \textbf{Variations in Dynamic Routing Settings}
% To explore the most suitable routing module, we conducted experiments on the MIMIC dataset with two types of routing modules: the impact of varying the number of routing layers in the FAM module and the influence of different routing architectures.

% As shown in Figure 5(a), we compared routing layers ranging from 1 to 5. The experimental results indicate that as the number of routing layers increases, the model's accuracy improves from 0.46 with 1 layer to 0.48 with 2 layers, but gradually decreases to 0.43 starting from 3 layers. Additionally, as illustrated in Figure 5(b), we evaluated different routing architectures, including a standard Transformer-based architecture, a TRAR-based architecture, and the simplest approach of merely concatenating the two features without routing. The results demonstrate that our current routing architecture is the most suitable.
% % 我们为了探索用什么路由模块最合适，在mimic数据集上进行了两类路由模块的实验：不同路由层数的FAM模块的影响以及不同路由架构的影响。

% % 如图5(a)，我们选择了1-5层路由层数来比较。实验结果表明，随着路由层数的增加，模型的准确率从1层的0.46提升到第2层的0.48，但从3层开始逐渐略有下降至0.43。另外如图5(b)所示，对于不同路由架构我们选择了基于标准Transformer的架构、基于TRAR的架构还有最简单的仅仅只把两个特征concat而不做路由，实验结果表明我们目前的路由架构是最合适。
% \begin{figure}[H]
%   \begin{minipage}[c]{0.48\textwidth} % 使用 [c] 居中对齐
%     \centering
%     \includegraphics[width=\linewidth]{midl/layer.png}
%     {\vspace{-25pt}\caption{\small{The effect of varying the number of routing layers in the FAM module on the network.}}\vspace{-20pt}}
%     \label{fig:number}
%   \end{minipage}
%   \hfill
%   \begin{minipage}[c]{0.48\textwidth} % 使用 [c] 居中对齐
%     \centering
%     \resizebox{\linewidth}{!}{
%     \begin{tabular}{lccccc}
%       \toprule
%       \textbf{Module} & \textbf{B-1} & \textbf{B-4} & \textbf{M} & \textbf{R-L} & \textbf{F1} \\
%       \midrule
%       Transformer & 0.399 & 0.113 & 0.160 & 0.269 & 0.476 \\
%       TRAR & 0.400 & 0.116 & 0.161 & 0.272 & 0.478 \\
%       Concat & 0.313 & 0.101 & 0.140 & 0.265 & 0.267 \\
%       Ours(FAM) & 0.402 & 0.119 & 0.163 & 0.275 & 0.483 \\
%       \bottomrule
%     \end{tabular}
%     }
%     % {\vspace{-25pt}\caption{\small{The effect of different routing architectures on the network.}}\vspace{-20pt}}
%     % \caption{\small{The effect of different routing architectures on the network.}}
%     {\captionof{Table 3: }{\small{The effect of different routing architectures on the network.}}}
%     \label{tab:cdifferent routing}
%   \end{minipage}
% \end{figure}





% \begin{figure}[H]
% \floatconts
%   {fig:example}
%   {\caption{图中展示了FAM模块中不同路由层数对F1分数的影响}}
%   {\includegraphics[width=1\linewidth]{midl/layer.png}}
% \end{figure}


% \begin{figure}[!t]
% \floatconts
%   {fig:example}
%   {\vspace{-25pt}\caption{\small{Impact of Different Settings in FAM 
% (a) The effect of varying the number of routing layers in the FAM module on the network. 
% (b) The effect of different routing architectures on the network.}}\vspace{-20pt}}
%   {\includegraphics[width=1\linewidth]{midl/routing exp.png}}}
% \end{figure}

% \begin{table}[H] \centering \begin{tabular}{lccccc} \toprule \textbf{Module} & \textbf{B-1} & \textbf{B-4} & \textbf{M} & \textbf{R-L} & \textbf{F1} \\ \midrule Transformer & 0.399 & 0.113 & 0.160 & 0.269 & 0.476 \\
% TRAR & 0.400 & 0.116 & 0.161 & 0.272 & 0.478 \\ 
% Concat & 0.313 & 0.101 & 0.140 & 0.265 & 0.267 \\ 
% Ours(FAM) & 0.402 & 0.119 & 0.163 & 0.275 & 0.483 \\ 
% \bottomrule \end{tabular} \caption{Ablation study of each module on MIMIC test set.} \label{tab:config_performance} \end{table}



\section{Conclusion}
In this paper, we propose an effective method for medical report generation, DiffRGenNet, designed to capture fine-grained features from both global and local dynamics, with a particular focus on regions of disease progression. We introduce the FAM (FlexiRoute Aggregation Module), which significantly enhances fine-grained feature extraction. Additionally, the proposed Diff module strengthens attention to areas of disease change. Finally, by employing contrastive learning with positive and negative samples, we further improve the model's generalization, robustness, and ability to identify rare diseases.
% 在本论文中，我们提出了一种高效的方法DaRNet，能够从全局和局部动态捕获更细粒度的特征，并特别关注疾病变化的部分。我们引入了FAM动态路由模块，显著提升了细粒度特征的提取能力。此外，提出的Diff模块增强了对疾病变化部分的关注度。最后，通过正负样本的对比学习，进一步增强了模型的泛化性、鲁棒性以及对罕见病的识别能力。


\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{Supported by Natural Science Foundation of China under Grant 62271465, Suzhou Basic Research Program under Grant SYG202338, and Open Fund Project of Guangdong Academy of Medical Sciences, China (No. YKY-KF202206).}


\bibliography{sample}


\appendix

\section{Variations in Dynamic Routing Settings}
To explore the most suitable routing module, we conducted experiments on the MIMIC dataset with two types of routing modules: the impact of varying the number of routing layers in the FAM module and the influence of different routing architectures.

As in Figure~\ref{fig:number}, we compared routing layers ranging from 1 to 5. The experimental results indicate that as the number of routing layers increases, the model's accuracy improves from 0.46 with 1 layer to 0.48 with 2 layers, but gradually decreases to 0.43 starting from 3 layers. Additionally, as illustrated in Table.~\ref{tab:cdifferent routing}, we evaluated different routing architectures, including a standard Transformer-based architecture, a TRAR-based architecture, and the simplest approach of merely concatenating the two features without routing. The results demonstrate that our current routing architecture is the most suitable.
% 我们为了探索用什么路由模块最合适，在mimic数据集上进行了两类路由模块的实验：不同路由层数的FAM模块的影响以及不同路由架构的影响。

% 如图5(a)，我们选择了1-5层路由层数来比较。实验结果表明，随着路由层数的增加，模型的准确率从1层的0.46提升到第2层的0.48，但从3层开始逐渐略有下降至0.43。另外如图5(b)所示，对于不同路由架构我们选择了基于标准Transformer的架构、基于TRAR的架构还有最简单的仅仅只把两个特征concat而不做路由，实验结果表明我们目前的路由架构是最合适。
\begin{figure}[H]
  \begin{minipage}[c]{0.48\textwidth} % 使用 [c] 居中对齐
    \centering
    \includegraphics[width=\linewidth]{midl/layer.png}
    % {\vspace{-25pt}\caption{\small{The effect of varying the number of routing layers in the FAM module on the network.}}\vspace{-20pt}}
    {\captionof{Figure A: }{\small{The effect of varying the number of routing layers in the FAM module on the network.}}}
    \label{fig:number}
  \end{minipage}
  \hfill
  \begin{minipage}[c]{0.48\textwidth} % 使用 [c] 居中对齐
    \centering
    \resizebox{\linewidth}{!}{
    \begin{tabular}{lccccc}
      \toprule
      \textbf{Module} & \textbf{B-1} & \textbf{B-4} & \textbf{M} & \textbf{R-L} & \textbf{F1} \\
      \midrule
      Transformer & 0.399 & 0.113 & 0.160 & 0.269 & 0.476 \\
      TRAR & 0.400 & 0.116 & 0.161 & 0.272 & 0.478 \\
      Concat & 0.313 & 0.101 & 0.140 & 0.265 & 0.267 \\
      Ours(FAM) & 0.402 & 0.119 & 0.163 & 0.275 & 0.483 \\
      \bottomrule
    \end{tabular}
    }
    % {\vspace{-25pt}\caption{\small{The effect of different routing architectures on the network.}}\vspace{-20pt}}
    % \caption{\small{The effect of different routing architectures on the network.}}
    {\captionof{Table A: }{\small{The effect of different routing architectures on the network.}}}
    \label{tab:cdifferent routing}
  \end{minipage}
\end{figure}


\section{Example of the model’s ability to capture meaningful differences}
我们认为模型捕捉差异性的能力在于捕捉差异性变化，如图3，灰色标注的代表疾病严重程度的词，红色代表了出现的疾病。对于我们的方法可以看到，在Ground-Truth中出现表程度的词都捕获到了，如moderate，midl在我们的方法中表现为mild to moderate、subtle，early。另外，borderline 体现出了我们方法对于更加细粒度变化的捕捉。


\begin{figure}[H]
\floatconts
  {fig:qualitative}
  {\vspace{-25pt}\caption{\small{Qualitative examples of the baseline~\cite{jin2024promptmrg} and the proposed method. Red  indicates consistent content with the ground-
truth while purple indicates incorrect one. }}\vspace{-20pt}}
  {\includegraphics[width=\linewidth]{midl/rebutalview.jpg}}
\end{figure}



\section{Ablation Study of the Feature Difference Module}
为了验证Feature Difference Module里的三个distinct metrics的effectiveness，我们分别进行了三次实验来证明：(1)去掉L2 distance and dot product similarity. (2) 去掉dot product similarity+cosine similarity .(3)去掉distance and cosine similarity.

消融实验证明了三个模块的必要性，并且从表中可以看出L2 distance对于差异性的捕捉是效果最好的，其他两个metrics也有微弱的提升作用
The effectiveness of the three distinct metrics in the Feature Difference Module should be quantitatively verified. Are all three metrics necessary, or could a subset achieve similar results?
\begin{table}[H]
\centering
\begin{tabular}{lccccc}
\toprule
\textbf{DiffRGenNet} &  \textbf{B-4} & \textbf{M} & \textbf{R-L} & \textbf{F1} \\
\midrule
w/ Feature Diffrence                  & 0.119 & 0.163 & 0.275 & 0.483 \\
w/o L2 distance+dot product similarity       & 0.115 & 0.160 & 0.270 & 0.481 \\
w/o dot product similarity+cosine similarity  &  \textbf{0.118} &  \textbf{0.162} & 0.271 &  \textbf{0.483} \\
w/o L2 distance+cosine similarity               & 0.116 & 0.160 &  \textbf{0.273} & 0.480 \\
\bottomrule
\end{tabular}
{\vspace{-5pt}\caption{\small{Ablation study of Feature Difference Module on MIMIC dataset.}}\label{tab:ablation}\vspace{-10pt}}
% \label{tab:ablation}
\end{table}

% \section{Proof of Theorem 2}

% This is a complete version of a proof sketched in the main text.

\end{document}