Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering
Keywords: Difference Visual Question Answering, Vision Encoder-Decoder Model, Transformers, Medical Imaging
TL;DR: A novel Vision Encoder-Decoder (VED) model for Difference Medical VQA effectively compares chest X-rays, identifying significant changes like pneumonia with state-of-the-art accuracy, enhancing clinical decision-making.
Abstract: Difference Medical Visual Question Answering (Diff-VQA), a specialized subfield of Medical VQA, tackles the critical task of identifying and describing differences between pairs of medical images. This study introduces a novel Vision Encoder-Decoder (VED) architecture tailored for this task, focusing on the comparison of chest X-ray images to detect and explain changes. The proposed model incorporates two key innovations: (1) a light-weight Transformer text decoder architecture capable of generating precise and contextually relevant answers to complex medical questions, and (2) an enhanced fusion mechanism that improves the model’s ability to distinguish between two input images, enabling more accurate comparison of radiological findings. Our approach excels in identifying significant changes, such as pneumonia and lung opacity, demonstrating its utility in automating preliminary radiological assessments. By leveraging large-scale, domain-specific datasets and employing advanced training strategies, our VED architecture achieves state-of-the-art performance on standard VQA metrics, setting a new benchmark in diagnostic accuracy. These advancements highlight the potential of Diff-VQA to enhance clinical workflows and support radiologists in making more precise, informed decisions.
Primary Subject Area: Generative Models
Secondary Subject Area: Application: Radiology
Paper Type: Methodological Development
Registration Requirement: Yes
Visa & Travel: Yes
Midl Latex Submission Checklist: Ensure no LaTeX errors during compilation., Created a single midl25_NNN.zip file with midl25_NNN.tex, midl25_NNN.bib, all necessary figures and files., Includes \documentclass{midl}, \jmlryear{2025}, \jmlrworkshop, \jmlrvolume, \editors, and correct \bibliography command., Did not override options of the hyperref package, Did not use the times package., Author and institution details are de-anonymized where needed. All author names, affiliations, and paper title are correctly spelled and capitalized in the biography section., References must use the .bib file. Did not override the bibliographystyle defined in midl.cls. Did not use \begin{thebibliography} directly to insert references., Tables and figures do not overflow margins; avoid using \scalebox; used \resizebox when needed., Included all necessary figures and removed *unused* files in the zip archive., Removed special formatting, visual annotations, and highlights used during rebuttal., All special characters in the paper and .bib file use LaTeX commands (e.g., \'e for é)., Appendices and supplementary material are included in the same PDF after references., Main paper does not exceed 9 pages; acknowledgements, references, and appendix start on page 10 or later.
Latex Code: zip
Copyright Form: pdf
Submission Number: 100
Loading