\section{Experimentation and Results}

All experiments were conducted on a single NVIDIA RTX 4090 GPU with 24GB of memory, ensuring a high level of computational efficiency and support for complex model architectures. Our implementation utilized the PyTorch deep learning framework (version: 2.1.1) in conjunction with the Hugging Face Transformers library (version: 4.27.4), providing a robust and flexible platform for training and fine-tuning our models.

For evaluation, we adhered closely to the metrics and methodologies outlined in the original paper, \added{all token-based}. Specifically, we employed BLEU \cite{bleu}, METEOR \cite{meteor}, ROUGE-L \cite{rouge}, and CIDEr \cite{cider} as performance metrics to assess the effectiveness of our model. \replaced{Also, we have added BERTScore \cite{bertscore} to have a score closer to a personal judgment, which token-based metrics cannot provide.}{The use of this diverse set of metrics ensures a comprehensive evaluation of the model's performance, balancing both linguistic fidelity and domain-specific requirements. Such an approach is particularly important for the Medical VQA task, where generating accurate, clear, and clinically relevant answers is critical.} The improvement in performance through each stage is shown in Table \ref{tab:stage_comparison} and the comparision of our model with state-of-the-art methods on the Medical-Diff-VQA dataset is shown in Table \ref{tab:results}.

\begin{table}[htbp]
    \centering
    \caption{Comparison of our model performance through all the stages. The best results are shown in \textbf{bold}.}
    \label{tab:stage_comparison}
    \resizebox{\textwidth}{!}{%
    \begin{tabular}{lcccccccc}
        \toprule
        Model                   & BLEU-1         & BLEU-2         & BLEU-3         & BLEU-4         & METEOR         & ROUGE-L        & CIDEr          & BERTScore      \\ \midrule
		Baseline                & 0.546          & 0.486          & 0.436          & 0.385          & 0.317          & 0.589          & 1.492          & 0.614          \\
        + IDE                   & 0.673          & 0.600          & 0.541          & 0.486          & 0.358          & 0.628          & 1.698          & 0.658          \\
		+ Train Decoder FT-Swin & 0.703          & 0.632          & 0.572          & 0.516          & 0.385          & 0.659          & 1.849          & \textbf{0.706} \\
		+ Unfreeze Swin         & \textbf{0.716} & \textbf{0.647} & \textbf{0.590} & \textbf{0.537} & \textbf{0.389} & \textbf{0.670} & \textbf{2.119} & 0.704          \\
        \bottomrule
    \end{tabular}%
    }
\end{table}

The results in Table \ref{tab:stage_comparison} highlight the continuous improvement achieved through our multi-stage training strategy. Starting from a baseline where we trained the entire VED architecture, with the decoder starting from scratch and the Swin model initialized using its pretrained weights on ImageNet-21k, the results reflect basic performance without domain-specific optimization. Incorporating IDE introduces a significant boost in performance across all metrics. In the second stage training the decoder while the fine-tuned Swin remains fronzen results in additional performance gains. Finally, in the third stage unfreezing the Swin model for joint optimization leads to the best results across all evaluation metrics.

\begin{table}[htbp]
    \centering
    \caption{Comparison of our model with state-of-the-art methods on the Medical-Diff-VQA dataset. The best results are shown in \textbf{bold}.}
    \label{tab:results}
    \resizebox{\textwidth}{!}{%
    \begin{tabular}{lccccccc}
        \toprule
        Methods                      & BLEU-1         & BLEU-2         & BLEU-3         & BLEU-4         & METEOR         & ROUGE-L         & CIDEr          \\ \midrule
        MCCFormers~\cite{mccformers} & 0.214          & 0.190          & 0.170          & 0.153          & 0.319          & 0.340           & 0              \\
        IDCPCL~\cite{idcpcl}         & 0.614          & 0.541          & 0.474          & 0.414          & 0.303          & 0.582           & 0.703          \\
        EKAID~\cite{mimicdiffvqa}    & 0.628          & 0.553          & 0.491          & 0.434          & 0.339          & 0.557           & 1.027          \\
        % CFGCL~\cite{cfgcl}           & 0.630          & 0.543          & 0.479          & 0.422          & 0.403          & 0.662           & 2.022          \\
        EIE-all~\cite{eieall}        & 0.646          & 0.583          & 0.528          & 0.477          & \textbf{0.401} & 0.653           & 1.698          \\
        RegioMix~\cite{regiomix}     & 0.705          & 0.633          & 0.572          & 0.517          & 0.381          & 0.651           & 1.804          \\
        PLURAL~\cite{plural}         & 0.704          & 0.633          & 0.575          & 0.520          & 0.381          & 0.653           & 1.832          \\ \midrule
        % ReAl~\cite{real}             & 0.710          & 0.636          & 0.580          & 0.530          & 0.395          & \textbf{0.736}  & \textbf{2.409} \\ 
        Ours                         & \textbf{0.716} & \textbf{0.647} & \textbf{0.590} & \textbf{0.537} & 0.389          & \textbf{0.670}  & \textbf{2.119} \\ \bottomrule
    \end{tabular}%
    }
\end{table}

The results presented in Table \ref{tab:results} demonstrate the superior performance of our model compared to state-of-the-art methods on the Medical-Diff-VQA dataset. Our model achieves the highest scores across all BLEU metrics, with 0.716 in BLEU-1 and 0.537 in BLEU-4, highlighting its ability to produce precise and linguistically faithful answers. While the METEOR score (0.389) is slightly lower than EIE-all (0.401), our model achieves the best performance in ROUGE-L (0.670), indicating strong coherence and structural accuracy. Notably, the CIDEr score of 2.119 significantly surpasses all competing methods, emphasizing our model's ability to generate clinically relevant and contextually accurate responses. These results underscore the effectiveness of our approach in addressing the Medical VQA task, setting a new benchmark for both linguistic fidelity and domain-specific relevance.

\begin{figure}[htbp]
	\centering
	\includegraphics[width=\linewidth]{figures/MUESTRAS2.pdf}
	\caption{Examples of difference questions and their corresponding answers generated by our model and the ground truth. Correct predictions are highlighted in \textcolor{mygreen}{green}, while incorrect predictions are highlighted in \textcolor{myred}{red}.}
	\label{fig:examples}
\end{figure}

Despite good results that even outperform the state-of-the-art, more research is needed to improve some current limitations. For example, Figure \ref{fig:examples} shows two scenarios where all metrics are close to the highest possible values. In the first scenario, the high metric values reflect the strong alignment between the model's predictions and the ground truth, highlighting its accuracy in this case.

The issue arises in the second scenario, where the metrics are 1.00, 0.968, 0.911, and 0.853 for BLEU, 0.589 for METEOR and 0.818 for ROUGE-L, values very close to the maximum and also exceeding those reported in Table \ref{tab:results}, yet the model's predicted response exhibits diagnostic errors. Upon closer inspection, it becomes evident that, while the model correctly identifies the anomalies present in the radiograph, it misinterprets the diagnosis, either assuming the resolution of certain anomalies when they have not subsided or identifying new ones that are not actually present. These results may suggest that the model's outputs are of high quality, yet they underscore a critical issue: high performance on these metrics does not necessarily translate to correctness in a clinical diagnostic context.

This observation opens a crucial debate about the appropriateness of existing evaluation metrics for tasks like Medical Visual Question Answering. It emphasizes the need to develop new metrics tailored to this domain, which go beyond assessing the linguistic or structural quality of the generated answers. For tasks involving medical diagnostics, it is essential to evaluate not only the fluency and relevance of the generated response but also its accuracy in identifying and interpreting anomalies. Addressing this gap is vital to ensure that models deployed in clinical settings contribute to accurate and reliable decision-making.
