\section{Introduction}

Visual Question Answering (VQA) is a challenging task that bridges the domains of computer vision (CV) and natural language processing (NLP). It involves the generation of accurate and contextually relevant answers to specific questions posed about visual data. Unlike general vision tasks that focus solely on image analysis, VQA requires a deeper understanding of both visual content and linguistic semantics, enabling a meaningful interaction between visual cues and textual inputs. This complexity makes VQA an essential tool for real-world applications that demand precise, task-specific insights.

% In the medical domain, VQA takes on an even greater challenge due to the intricate nature of medical images and the critical importance of interpretative accuracy. Medical VQA must address diverse question types, including the identification of abnormalities, their locations, severities, and comparative analyses of temporal image changes. These tasks are compounded by the variability in medical imaging modalities, the need for domain-specific expertise, and the ambiguity or noise often present in clinical data. Additionally, aligning automated systems with clinical workflows remains a persistent issue, as practitioners require highly specific, actionable answers rather than generic predictions.

Building upon the challenges of Medical VQA, Difference Medical Visual Question Answering introduces an additional layer of complexity by focusing on questions that require identifying and describing differences between pairs of medical images. This task extends the interpretative demands of Medical VQA by incorporating a comparative dimension, where systems must not only analyze individual images but also discern and articulate clinically relevant changes between them. Addressing this problem requires a nuanced understanding of temporal and spatial variations, integration of contextual information across imaging pairs, and the ability to generate precise, actionable descriptions of the differences. As such, Difference Medical VQA represents a significant advancement in the pursuit of automated, clinically meaningful insights.

To address these challenges, we present a model based on a Vision Encoder-Decoder (VED) architecture specifically designed for the Medical VQA task. This model is trained in three stages to achieve superior performance. In the first stage, the vision encoder is fine-tuned on a large-scale medical imaging dataset to capture domain-specific visual features essential for accurate medical reasoning. In the second stage, the fine-tuned vision encoder is freezed and integrated with a text decoder and trained on a specialized Medical VQA dataset. Then, in the third stage, the encoder is unfreezed and the entire model is fine-tuned to optimize the fusion of visual and textual information for generating precise answers. This comprehensive training process enables the model to produce clinically precise and contextually relevant answers, ensuring robustness and adaptability to the real-world demands of medical QA.

\deleted{The model is evaluated on the Medical-Diff-VQA dataset, derived from the MIMIC-CXR database. This dataset is designed specifically for Medical VQA tasks, featuring diverse question types and focusing on comparative analyses across pairs of chest X-ray (CXR) images. With its clinical relevance and scale, the dataset provides a rigorous testing ground for assessing the performance of the proposed architecture.}

In this paper, the key contributions of this study are a light-weight Transformer text decoder architecture capable of generating precise and contextually relevant answers to complex medical questions and a mechanism to enhance the model's ability to distinguish between two input images during the fusion process.

To foster transparency and collaboration, we are open-sourcing our methodology on GitHub\footnote{\url{https://github.com/ljmtendero/A-VED-Model-For-Difference-Medical-VQA}}. We aim to accelerate advancements in diagnostic precision and radiological interpretation, empowering the medical imaging community to enhance patient care through innovative AI-driven solutions.
