Keywords: unanswerable question, NLP datasets, metrics
TL;DR: We propose a novel framework to autonomously generate the FactGuard-Bench dataset, improving LLMs' accuracy in distinguishing answerable and unanswerable questions in long-context reading comprehension.
Abstract: Large language models (LLMs) have demonstrated significant advances in reading comprehension. However, a persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. This issue remains critical, particularly as the length of supported contexts continues to expand. To address this challenge, we propose a collaborative multi-task workflow called FactGuard to automatically generate evidence-based question-answer pairs and systematically construct unanswerable questions. Using this methodology, we developed the FactGuard-Bench dataset, which comprises 25,220 examples of both answerable and unanswerable question scenarios, with context lengths ranging from 4K to 128K. Experimental evaluations conducted on nine popular LLMs reveal that all LLMs exhibit significant performance gap between answerable and unanswerable questions and the most advanced models achieve only 67.67\% overall accuracy. After training with FactGuard-Bench, the model achieves an overall accuracy of 81.17\%, along with enhanced reasoning capabilities on unanswerable questions. Our code is publicly available at https://anonymous.4open.science/r/FACTGUARD-5BBC
Primary Area: datasets and benchmarks
Submission Number: 17376
Loading