Abstract: Large Language Models (LLMs) have demonstrated strong performance in question answering (QA) tasks. However, Multi-Answer Question Answering (MAQA), where a question
may have several valid answers, remains challenging. Traditional QA settings often assume consistency across evidences, but MAQA
can involve conflicting answers. Constructing
datasets that reflect such conflicts is costly and
labor-intensive, while existing benchmarks often rely on synthetic data, restrict the task to
yes/no questions, or apply unverified automated
annotation. To advance research in this area,
we extend the conflict-aware MAQA setting to
require models not only to identify all valid
answers, but also to detect specific conflicting
answer pairs, if any. To support this task, we introduce a novel cost-effective methodology for
leveraging fact-checking datasets to construct
NatConfQA, a new benchmark for realistic,
conflict-aware MAQA, enriched with detailed
conflict labels, for all answer pairs. We evaluate
eight high-end LLMs on NatConfQA, revealing their fragility in handling various types of
conflicts and the flawed strategies they employ
to resolve them.
Loading