HealMQA: A Healthcare Multimodal Question Answering Dataset for Benchmarking Large Language Models

Siddhant Agarwal; Deepak Gupta; Shweta Yadav

HealMQA: A Healthcare Multimodal Question Answering Dataset for Benchmarking Large Language Models

Siddhant Agarwal, Deepak Gupta, Shweta Yadav

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Question Answering, Consumer Healthcare, Large Language Models, Medical AI, Benchmarking, NLP

TL;DR: We introduce HealsMQA, an expert-annotated multimodal dataset of real consumer healthcare questions on diverse healthcare topics, and benchmark with state-of-the-art LLMs.

Abstract: Consumers increasingly rely on digital platforms to seek healthcare advice, yet existing medical question answering (QA) resources are primarily unimodal or professional-facing, limiting their relevance to real-world users. To address this gap, we introduce Consumer Healthcare Multimodal Question Answering (CHMQA), a novel task for generating clinically valid free-text answers to multimodal consumer health queries. To benchmark this task, we present HealMQA, an expert-annotated multimodal QA dataset, consisting of 1,022 real-world consumer questions paired with medically validated images and expert-written answers spanning 17 healthcare topics. We evaluate eight state-of-the-art large language models (LLMs) under zero-shot and few-shot prompting methods, complemented by automated metrics and human evaluation by licensed medical professionals. Our experiments reveal that while LLMs achieve strong performance in accuracy, clarity, and safety, they continue to struggle with effectively integrating the visual modality. These findings highlight both the promise and limitations of current systems, and position HealMQA as a foundation for advancing medically accurate and consumer-centered multimodal healthcare QA.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 23141

Loading