\deleted{The Medical-Diff-VQA \cite{mimicdiffvqa} dataset is a comprehensive resource specifically designed for the Difference VQA task in the medical domain. It focuses on answering questions about differences between pairs of chest X-ray images, reflecting a radiologist's typical workflow of comparative analysis for diagnostic evaluation.}

\deleted{Derived from the MIMIC-CXR \cite{mimiccxr} database, the dataset comprises over 164,000 pairs of main and reference chest X-ray images, resulting in a total of 700,703 question-answer pairs. These questions are categorized into seven types: abnormality, location, type, level, view, presence, and difference. The ``difference" category, which focuses on identifying changes between two images, includes 164,324 question-answer pairs, making it the most relevant subset for this study.}

\deleted{This dataset aligns with the clinical treatment process—assessment, diagnosis, intervention, and evaluation—by emphasizing differential analysis. Such a focus ensures its applicability to real-world scenarios where understanding image variations over time is crucial for patient management.}

\deleted{For the purposes of this work, only the ``difference" question type is considered, as it aligns with the current state-of-the-art focus in the field and offers the most direct application for advancing medical diagnostic tools. By leveraging this targeted subset, the study aims to enhance model performance in generating precise and clinically relevant answers.}

\section{Related Work}

\deleted{The Expert Knowledge-Aware Graph Representation (EKAID) \cite{mimicdiffvqa} model, introduced alongside the Medical-Diff-VQA dataset, is designed for difference VQA tasks in medical imaging. This model represents anatomical structures as graph nodes and constructs multi-relationship graphs to capture spatial, semantic, and implicit relationships. Features are extracted from anatomical and disease regions using Faster-RCNNs trained on medical imaging datasets and are further refined through a Relation-Aware Graph Attention Network (ReGAT), finally LSTM with attention modules are used as decoder or answer generator. By incorporating medical knowledge graphs, the model leverages domain-specific insights to enhance interpretability and provide accurate answers to questions about image differences. The model demonstrates its capability in handling subtle disease progressions and mitigating variations caused by pose or orientation differences, outperforming MCCFormers and IDCPCL in a baseline comparision.}

\deleted{An Expert Insight-Enhanced (EIE) framework has been proposed for follow-up chest X-ray summary generation. This model incorporates expert-guided difference capture module, a two-layer transformer and a cross-modality follow-up summary generator, a three-layer transformer.}

\deleted{A retrieval-augmented approach, termed RegioMix, utilizes mix-and-match strategies to generate pseudo-difference descriptions. This method employs region-specific retrieval augmentation and a Dual Alignment module to align retrieved descriptions with input image pairs and questions. As a encoder, they use the encoder backbone adapted from EKAIDA and as decoder, two LSTM modules.}

\deleted{PLURAL employs a Transformer-based encoder-decoder framework and a three-stage training process: (1) pretraining on general-purpose datasets (e.g., COCO, CC12M) to establish foundational vision-language understanding, (2) pretraining on longitudinal chest X-ray data (MIMIC-CXR, MIMIC-Diff-VQA) with additional input branches for temporal information and insights from radiology reports, and (3) fine-tuning with DiffVQA-specific data. The model processes pairs of past and current images using a ResNet-101 encoders and integrates positional and temporal encodings. Outputs are passed to OFA, a Transformer encoder-decoder for answer generation.}

\added{The Expert Knowledge-Aware Graph Representation (EKAID) \cite{mimicdiffvqa} model, introduced alongside the Medical-Diff-VQA dataset, is designed for difference VQA tasks in medical imaging. It represents anatomical structures as graph nodes and constructs multi-relationship graphs to capture spatial, semantic, and implicit relationships. Features are extracted from anatomical and disease regions using Faster-RCNNs \cite{fasterrcnn} trained on medical imaging datasets and are further refined through a Relation-Aware Graph Attention Network (ReGAT) \cite{regat}, with an LSTM \cite{lstm} and attention modules serving as the decoder for answer generation. By incorporating medical knowledge graphs, the model leverages domain-specific insights to enhance interpretability and provide accurate answers about image differences. EKAID demonstrates its capability in handling subtle disease progressions and mitigating variations caused by pose or orientation differences, outperforming MCCFormers \cite{mccformers} and IDCPCL cite{idcpcl} in a baseline comparison.}

\added{Building upon difference-aware medical imaging models, an Expert Insight-Enhanced (EIE) \cite{eieall} framework has been proposed for follow-up chest X-ray summary generation. This model integrates an expert-guided difference capture module, a two-layer transformer, and a cross-modality follow-up summary generator, which employs a three-layer transformer to improve result coherence. Similarly, RegioMix \cite{regiomix} introduces a retrieval-augmented approach that employs mix-and-match strategies to generate pseudo-difference descriptions. It utilizes region-specific retrieval augmentation and a dual alignment module to align retrieved descriptions with input image pairs and questions, employing an encoder backbone adapted from EKAID and a decoder comprising two LSTM modules.}

\added{Further advancing temporal medical VQA, PLURAL \cite{plural} employs a Transformer-based encoder-decoder framework trained in three stages: (1) pretraining on general-purpose datasets (e.g., COCO \cite{coco}, CC12M \cite{cc12m}) to establish foundational vision-language understanding, (2) pretraining on longitudinal chest X-ray data (MIMIC-CXR, MIMIC-Diff-VQA) with additional input branches for temporal information and insights from radiology reports, and (3) fine-tuning with DiffVQA-specific data. The model processes pairs of past and current images using ResNet-101 \cite{resnet} encoders while integrating positional and temporal encodings, with outputs passed to OFA \cite{ofa}, a Transformer encoder-decoder for answer generation.} \deleted{Collectively, these approaches illustrate the evolving landscape of difference-aware medical VQA, leveraging graph-based representations, retrieval-augmented techniques, and transformer-based architectures to enhance interpretability and accuracy.}

%ReAl \cite{real} introduces a residual-based approach for DiffVQA, explicitly focusing on temporal differences between images. It employs three encoders: image encoders for past and current images, a residual encoder for processing differences between image pairs, and a text encoder for questions. The Residual Feature Alignment (RFA) module ensures alignment between residual and image-based features, enhancing attention to disparities. Unlike classification-based VQA methods, ReAl uses a GPT-2-based \cite{gpt2} generative decoder to dynamically generate detailed answers. The model is trained with a joint loss combining generative and consistency losses, optimizing performance on the Medical-Diff-VQA dataset.%