DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

ACL ARR 2026 January Submission3212 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Question Answering, Benchmark, Dataset, Disaster Management, Factual Evaluation, Robustness

Abstract: Accurate question answering (QA) in disaster management requires navigating highly uncertain and conflicting information, yet language models are rarely evaluated under the specific constraints of this domain. Existing QA benchmarks often rely on clean and consistent evidence, which does not fully capture the extreme ambiguity and fragmentation characteristic of real-world crisis scenarios. We introduce DisastQA, a comprehensive benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. To assess robustness, we evaluate models under varying evidence settings, ranging from closed-book generation to noisy evidence integration, disentangling internal knowledge from reasoning capabilities. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol that emphasizes factual completeness while mitigating verbosity bias. An evaluation of 20 models reveals substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems, performance degrades markedly under realistic noise, exposing critical reliability gaps in disaster-response settings. All code, data, and evaluation scripts are publicly available at https://anonymous.4open.science/r/DisastQA-4490.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarks, datasets, evaluation metrics, question answering

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 3212

Loading