Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering

Luis-Jesus Marhuenda; Miquel Obrador-Reina; Mohamed Aas-Alas; Alberto Albiol; Roberto Paredes

Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering

Luis-Jesus Marhuenda, Miquel Obrador-Reina, Mohamed Aas-Alas, Alberto Albiol, Roberto Paredes

Published: 27 Mar 2025, Last Modified: 11 Jul 2025MIDL 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Difference Visual Question Answering, Vision Encoder-Decoder Model, Transformers, Medical Imaging

TL;DR: A novel Vision Encoder-Decoder (VED) model for Difference Medical VQA effectively compares chest X-rays, identifying significant changes like pneumonia with state-of-the-art accuracy, enhancing clinical decision-making.

Abstract: Difference Medical Visual Question Answering (Diff-VQA), a specialized subfield of Medical VQA, tackles the critical task of identifying and describing differences between pairs of medical images. This study introduces a novel Vision Encoder-Decoder (VED) architecture tailored for this task, focusing on the comparison of chest X-ray images to detect and explain changes. The proposed model incorporates two key innovations: (1) a light-weight Transformer text decoder architecture capable of generating precise and contextually relevant answers to complex medical questions, and (2) an enhanced fusion mechanism that improves the model’s ability to distinguish between two input images, enabling more accurate comparison of radiological findings. Our approach excels in identifying significant changes, such as pneumonia and lung opacity, demonstrating its utility in automating preliminary radiological assessments. By leveraging large-scale, domain-specific datasets and employing advanced training strategies, our VED architecture achieves state-of-the-art performance on standard VQA metrics, setting a new benchmark in diagnostic accuracy. These advancements highlight the potential of Diff-VQA to enhance clinical workflows and support radiologists in making more precise, informed decisions.

Primary Subject Area: Generative Models

Secondary Subject Area: Application: Radiology

Paper Type: Methodological Development

Registration Requirement: Yes

Visa & Travel: Yes

Midl Latex Submission Checklist: Ensure no LaTeX errors during compilation., Created a single midl25_NNN.zip file with midl25_NNN.tex, midl25_NNN.bib, all necessary figures and files., Includes \documentclass{midl}, \jmlryear{2025}, \jmlrworkshop, \jmlrvolume, \editors, and correct \bibliography command., Did not override options of the hyperref package, Did not use the times package., Author and institution details are de-anonymized where needed. All author names, affiliations, and paper title are correctly spelled and capitalized in the biography section., References must use the .bib file. Did not override the bibliographystyle defined in midl.cls. Did not use \begin{thebibliography} directly to insert references., Tables and figures do not overflow margins; avoid using \scalebox; used \resizebox when needed., Included all necessary figures and removed *unused* files in the zip archive., Removed special formatting, visual annotations, and highlights used during rebuttal., All special characters in the paper and .bib file use LaTeX commands (e.g., \'e for é)., Appendices and supplementary material are included in the same PDF after references., Main paper does not exceed 9 pages; acknowledgements, references, and appendix start on page 10 or later.

Latex Code: zip

Copyright Form: pdf

Submission Number: 100

Loading