How Correct Is Your Answer? A Semantic Correctness Framework for Open QA Evaluation

How Correct Is Your Answer? A Semantic Correctness Framework for Open QA Evaluation

ACL ARR 2026 May Submission13682 Authors

26 May 2026 (modified: 17 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: question answering; open-ended QA; natural language inference; taxonomy; answer correctness; partial answers; overgeneration; contradiction; false premises; QA evaluation; LLM evaluation

Abstract: Reliable evaluation of open-ended question answering remains a bottleneck for measuring the factual competence of modern LLMs. Unlike multiple-choice tasks, free-form answers may be correct in many surface forms and may fail in qualitatively different ways, including incompleteness, contradiction, overgeneration, and acceptance of false premises. Existing judgment-based and similarity-based metrics often collapse these distinctions. We address this gap with three reusable contributions. First, we introduce a fine-grained semantic correctness taxonomy that assigns Open-QA answers to eight ordered classes, separating verbose-but-correct answers from answers contaminated by hallucinated content. Second, we release CAP-Correctness, a 10k-example benchmark spanning widely used QA datasets, and CAP-Statements, an 11k-example dataset for converting QA pairs into declarative statements for NLI training and statement-based evaluation. Third, we introduce CAP, Context-Aware Precision, a reference-based metric that scores question-conditioned statements using bidirectional NLI. Under a monotonicity protocol that tests whether metrics respect the taxonomy’s intended ordering, CAP outperforms established baselines.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: evaluation methodologies, metrics, benchmarking, NLP datasets, open-domain QA, textual entailment, natural language inference, automatic evaluation

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: no

Submission Number: 13682

Loading