Abstract: Modern medical community seeks precise, multimodal interpretability. People want to explicitly connect image regions to diagnostic outcomes and reason using natural language. Large Multimodal Models (LMMs ) are rapidly advancing open domain vision-language reasoning, yet progress in medical visual question answering (Med-VQA ) remains limited by two persistent bottlenecks: the scarcity of large -scale region -grounded supervision and the high cost of continuous radiologist over-sight. We present an automated Chest X-ray Med-VQA generation-validation pipeline and a grounded Chest X-ray (CXR) dataset GIV-CXR built on top of the Chest ImaGenome dataset. The automated pipeline incorporates LMMs based question-answers generation and validation scaling grounded data generation while preserving clinical reliability. Prompts incorporating domain experts insights regulate question-answer generation ensuring clinical regulation and Large Language Models (LLMs) evaluators bring in the reliability from model generated Question answers. GIV-CXR is a large scale dataset embibing 20,534 images from Chest ImaGenome, annotated over 81,257 bounding boxes, resulting in 354,293 question-answer pairs. The prompts used to generate the QA pairs are designed strategically to imbibe in-depth reasoning for efficient grounding. Standard MLMs underperformed on a sampled test set highlighting the lack of grounding capabilities of the models. On fine-tuning the LMMs on our dataset, the models demonstrate significantly better reasoning and grounding enhancing their interpretability. We will release the resources along with a detailed instructions and ethical use guidelines upon acceptance.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Medical Visual Question Answering, Large Multimodal Models, Interpretability, Visual Grounding, LLM-as-a-judge
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: 6
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 2
B2 Discuss The License For Artifacts: No
B2 Elaboration: We will do that with dataset and resource release upon acceptance
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: We will do that with dataset and resource release upon acceptance
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We have used a public opensource dataset
B5 Documentation Of Artifacts: No
B5 Elaboration: We will do that with dataset and resource release upon acceptance
B6 Statistics For Data: Yes
B6 Elaboration: Apendix of the paper
C Computational Experiments: Yes
C1 Model Size And Budget: N/A
C1 Elaboration: Some proprietary servers were used about which details can't be provided. We can give some details in the apendix on acceptance
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: We discuss the models used and all hyperparameters
C3 Descriptive Statistics: Yes
C3 Elaboration: Apendix of the paper
C4 Parameters For Packages: Yes
C4 Elaboration: 3
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: N/A
D1 Elaboration: We can give some details in the apendix on acceptance
D2 Recruitment And Payment: N/A
D2 Elaboration: That was a internal decision which cannot be shared. The annotators were domain experts so the pay was adequate.
D3 Data Consent: N/A
D3 Elaboration: That was a internal decision which cannot be shared. The annotators were domain experts so there was a formal agreement.
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: 6
Author Submission Checklist: yes
Submission Number: 1344
Loading