\section{Our Approach To Medical Visual Question Answering}

Our proposed model for the Difference Medical VQA task utilizes a \replaced{VED}{Vision Encoder-Decoder (VED)} architecture as shown in Figure \ref{fig:model_arquitecture}. It integrates the Swin Transformer \cite{swin} as a vision model and a transformer decoder as the language model.

% Add arquitecture figure here
\begin{figure}[h!]
	\centering
	\includegraphics[width=\linewidth]{figures/MODELARQUITECTURE.pdf}
	\caption{Our proposed architecture for \added{Difference} Medical Visual Question Answering.}
	\label{fig:model_arquitecture}
\end{figure}

\subsection{Vision Component}

\added{We evaluated multiple vision encoders, including EfficientNet \cite{tan2020efficientnetrethinkingmodelscaling}, ViT \cite{vit}, and TinyViT \cite{wu2022tinyvitfastpretrainingdistillation}, as potential candidates for our vision module. Preliminary experiments with EfficientNet resulted in significantly degraded performance. Similarly, TinyViT and standard ViT architectures required substantially larger models to achieve comparable accuracy. Given these limitations, we selected the Swin Transformer due to its strong performance across various vision tasks, including image classification, object detection, and segmentation, while maintaining an excellent trade-off between accuracy and computational efficiency.}

Swin’s hierarchical architecture and its efficient shifted window-based self-attention mechanism make it particularly well-suited for X-ray image analysis. Our implementation employs the Swin Base variant (SwinB-384), pretrained on ImageNet-21K \cite{imagenet21k} and fine-tuned on the MIMIC-CXR dataset using CheXpert \cite{chexpert} labels. Input images are processed as three-channel 384 $\times$ 384 pixel images. The Swin model consists of four stages with depths of 2, 2, 18, and 2, a patch size of 4 $\times$ 4, and a window size of 12 $\times$ 12.

\subsection{Differentation Component}

In addition, we introduce an Image Differentiation Embedding (IDE) mechanism to help the model distinguish between the two input images during the fusion process. During training, we learn two tensors of dimension $d$, where $d$ is the hidden size of the decoder. These tensors are learned concurrently with the entire model through backpropagation during training and are then kept fixed during inference. \deleted{The IDE is represented as a learnable embedding matrix, $E \in \mathbb{R}^{2 \times d}$, where $d$ is the hidden dimension of the decoder. Each row of $E$ corresponds to a unique identifier for one of the two images in the input pair.}

For each image, one of these tensors is added across all its sequence tokens. Specifically, for the feature tensor of the first image $F \in \mathbb{R}^{t \times d}$, we add the first learned tensor $IDE_1 \in \mathbb{R}^{1 \times d}$ to each of the $t$ tokens of the first image. The same procedure applies to the second image using the second learned tensor $IDE_2$. In Appendix \ref{appendix:ide} we show how we implement IDE in PyTorch. \deleted{For each image, the embedding associated with its row in $E$ is broadcasted and added across all its feature tokens. Specifically, given an image feature tensor $F \in \mathbb{R}^{b \times t \times d}$, where $b$ is the batch size, $t$ is the sequence of tokens, and $d$ is the feature dimension, the corresponding embedding $e_i \in \mathbb{R}^{d}$ (from the $i$-th row of $E$) is added to each of the $t$ tokens.} This mechanism ensures that each image retains a distinct differentiating signal throughout the processing pipeline, as reflected in the performance improvements shown in Table \ref{tab:stage_comparison}.

\subsection{Language Component}

\added{Our choice of a 3-layer transformer decoder is motivated by several considerations. First, the dataset features a very limited vocabulary of only 101 words, and the questions and answers exhibit a similar structure, reducing the need for a large, heavily pretrained model with an extensive vocabulary. We conducted experiments with decoder configurations of 2, 3, and 4 layers. The 3-layer setup emerged as the optimal configuration—2 layers resulted in inferior performance, and although 4 layers matched the performance of the 3-layer model, it imposed higher computational costs without additional benefits. Furthermore, our experiments with a pre-trained BERT-base decoder (augmented with new cross-attention layers) yielded significantly worse performance, indicating that a pre-trained decoder might not align well with our task's specific characteristics. Therefore, a lightweight, 3-layer transformer decoder strikes the best balance between computational efficiency and effective fusion of visual and textual information for autoregressive decoding in the medical question-answering context.}

\added{The decoder has a hidden size of 1024, an intermediate size of 4096, and employs GELU activation. It comprises both self-attention layers and cross-attention layers. In this design, the cross-attention mechanism leverages the final-stage output of the Swin Transformer as keys and values, while the decoder’s self-attention layers process the textual context. This arrangement enables effective fusion of visual and textual information, facilitating accurate autoregressive decoding of medical question answers. Notably, this lightweight decoder, containing approximately 51M parameters, is significantly smaller than the PLURAL model \mbox{\cite{plural}}, which has 184M parameters.}

\deleted{For the decoder, we chose the BERT architecture due to its proven success in various NLP tasks. However, to balance performance with computational efficiency, we reduced the standard 12-layer configuration to a 3-layer architecture with a hidden size of 1024. This transformer-based text decoder comprises only 51M parameters, making it much more lightweight compared to the PLURAL \mbox{\cite{plural}} model, the current state-of-the-art, which has 184M parameters.}

\deleted{The connection between the Swin Transformer and BERT is established through a cross-attention mechanism. Here, the Swin model's final stage output serves as key and value inputs, while the BERT decoder's output is used as the query input. This mechanism facilitates effective fusion of visual and textual information, enabling accurate autoregressive decoding of medical question answers.}

\subsection{Training Strategy}

Our training strategy is a three-stage process aimed at optimizing the model’s vision and language components sequentially before integrating them for the VQA task.

In the first stage, we fine-tune the Swin Transformer on the MIMIC-CXR dataset with CheXpert labels. Since the Medical-Diff-VQA dataset is derived from the MIMIC-CXR dataset, we ensure consistency by using the same dataset splits. The AdamW \cite{adamw} optimizer is used with a learning rate of $1\times10^{-4}$, a batch size of 24, and a weight decay of 0.05. Training is conducted for 30 epochs using a Cosine Annealing learning rate scheduler. The best-performing model is selected based on the validation loss.

In the second stage, the fine-tuned Swin model is integrated with the transformer decoder to construct the Vision Encoder-Decoder (VED) architecture. During this phase, the parameters of the Swin model remain frozen, allowing the training to focus exclusively on the decoder. The decoder is trained for 20 epochs using a batch size of 64, employing the Adam optimizer \cite{adam} with a learning rate of $3\times10^{-4}$

In the third stage, we unfreeze the Swin model and fine-tune the entire VED architecture for 20 more epochs with the learning rate set to $3\times10^{-6}$. Since we are training the entire model, we use a smaller batch size of 8 but simulating a batch size of 64 with gradient accumulation to reduce memory consumption.

The model is trained using the negative log-likelihood loss function, which calculates the probability of the correct answer given the question and the images. During training, the learning rate is linearly decreased, and Hard Negative Mining \cite{hnm} is employed to select the most challenging samples for further training. To improve the quality of generated text, beam search with a beam size of 2 is used during inference, enabling the model to explore multiple possible outputs and select the most likely answer. For image inputs, a series of augmentation techniques—including shift, scale, rotation, and brightness/contrast adjustments—are applied to simulate various real-world conditions and enhance the model's robustness.

\subsection{Medical-Diff-VQA dataset}

\added{The Medical-Diff-VQA \cite{mimicdiffvqa} dataset is a comprehensive resource specifically designed for the Difference VQA task in the medical domain. It focuses on answering questions about differences between pairs of chest X-ray images, reflecting a radiologist's typical workflow of comparative analysis for diagnostic evaluation.}

\added{Derived from the MIMIC-CXR \cite{mimiccxr} database, the dataset comprises over 164,000 pairs of main and reference chest X-ray images, resulting in a total of 700,703 question-answer pairs. These questions are categorized into seven types: abnormality, location, type, level, view, presence, and difference. The ``difference" category, which focuses on identifying changes between two images, includes 164,324 question-answer pairs, making it the most relevant subset for this study.}

\added{For the purposes of this work, only the ``difference" question type is considered, as it aligns with the current state-of-the-art focus in the field and offers the most direct application for advancing medical diagnostic tools. By leveraging this targeted subset, the study aims to enhance model performance in generating precise and clinically relevant answers.}

\subsection{Framework Overview}

\added{Our framework integrates a vision encoder with a transformer decoder in a three-stage training process:}
    
\begin{itemize} 
    \item \added{\textbf{Input:} Two preprocessed 384$\times$384 chest X-ray images and tokenized question.}
    \item \added{\textbf{Vision:} A Swin Transformer extracts hierarchical features, enhanced by an Image Differentiation Embedding (IDE) to distinguish the images.}
    \item \added{\textbf{Language:} A lightweight 3-layer transformer decoder fuses visual features with textual context via cross-attention to generate answers.}
    \item \added{\textbf{Training:} We fine-tune the vision module, train the decoder with frozen vision parameters, and finally fine-tune the entire model end-to-end.}
\end{itemize}

%The loss function for the $i$-th sequence is defined as:

%\begin{equation}
%\mathcal{L}^{(i)} = -\frac{1}{T} \sum_{t=1}^{T} \log P(y_t^{(i)} | y_{<t}^{(i)}, x^{(i)})
%\end{equation}

%For each input sequence, the model predicts the probability of the next token given the previous tokens and the input context. Specifically, the term $P(y_t^{(i)} | y_{<t}^{(i)}, x^{(i)})$ represents the conditional probability of the token $y_t^{(i)}$ at time step $t$ in the $i$-th sequence given the previous tokens $y_{<t}^{(i)}$ and the input context $x^{(i)}$. The loss is then averaged over the sequence length $T$.%
