A Structured, Tagged, and Localized Visual Question Answering Dataset with Full Sentence Answers and Scene Graphs for Chest X-ray Images

07 Apr 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: VQA, Localization, Vision-Language Modeling, Medical Imaging, Chest X-Rays, Scene Graphs
TL;DR: We present a large-scale CXR VQA dataset derived from MIMIC-CXR with 42M QA pairs, featuring multi-part answers, bounding boxes, and structured tags; it was generated using LLM-based extraction from radiology reports and localization models.
Abstract: Visual Question Answering (VQA) enables targeted and context-dependent analysis of medical images, such as chest X-rays (CXRs). However, existing VQA datasets for CXRs are typically constrained by simplistic and brief answer formats, lacking localization annotations (e.g., bounding boxes) and little metadata (e.g., region or radiological finding/disease tags). To address these limitations, we introduce MIMIC-Ext-CXR-QBA (abbr. CXR-QBA), a large-scale CXR VQA dataset derived from MIMIC-CXR, comprising 42 million QA-pairs with multi-granular, multi-part answers, detailed bounding boxes, and structured tags. We automatically generated our VQA dataset from scene graphs (also made available), which we constructed using LLM-based information extraction from radiology reports. After automatic quality assessment, we identified 31M pre-training and 7.5M fine-tuning grade QA-pairs, providing the largest and most sophisticated VQA dataset for CXRs to date. Tools for using our dataset and the construction pipeline are available at https://anonymous.4open.science/r/mimic-ext-cxr-qba/ .
Primary Area: AL/ML Datasets & Benchmarks for health sciences (e.g. climate, health, life sciences, physics, social sciences)
Croissant File: json
Dataset URL: https://drive.google.com/drive/folders/1U_A66GBzBRC6UuaT2tPzKXw4eJlE7xR_?usp=drive_link
Code URL: https://anonymous.4open.science/r/mimic-ext-cxr-qba
Submission Number: 54
Loading