Keywords: question answering; open-ended QA; natural language inference; taxonomy; answer correctness; partial answers; overgeneration; contradiction; false premises; QA evaluation; LLM evaluation
Abstract: Reliable evaluation of open-ended question answering remains a bottleneck for measuring the factual competence of modern LLMs. Unlike multiple-choice tasks, free-form answers may be correct in many surface forms and may fail in qualitatively different ways, including incompleteness, contradiction, overgeneration, and acceptance of false premises. Existing judgment-based and similarity-based metrics often collapse these distinctions.
We address this gap with three reusable contributions. First, we introduce a fine-grained semantic correctness taxonomy that assigns Open-QA answers to eight ordered classes, separating verbose-but-correct answers from answers contaminated by hallucinated content. Second, we release CAP-Correctness, a 10k-example benchmark spanning widely used QA datasets, and CAP-Statements, an 11k-example dataset for converting QA pairs into declarative statements for NLI training and statement-based evaluation. Third, we introduce CAP, Context-Aware Precision, a reference-based metric that scores question-conditioned statements using bidirectional NLI.
Under a monotonicity protocol that tests whether metrics respect the taxonomy’s intended ordering, CAP outperforms established baselines.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: evaluation methodologies, metrics, benchmarking, NLP datasets, open-domain QA, textual entailment, natural language inference, automatic evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 13682
Loading