Revealing The Intrinsic Ability of Generative Text Summarizers for Irrelevant Document Detection

ACL ARR 2024 June Submission2651 Authors

15 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In Retrieval-Augmented Generation (RAG), generative models are prone to performance degradation due to retrieved irrelevant documents. Adding irrelevant documents to the training data and retraining language models incurs significant costs. Supervised models can detect irrelevant documents in the retrieved results and avoid retraining, but they cannot counter domain shifts in the real world. By introducing a method that emphasizes the unique features of infrequent words, we reveal the ability of the cross-attention mechanism to detect irrelevant documents within the inputs of generative models. We present CODE, a novel irrelevant document detector using a closed-form expression rooted in cross-attention scores. Our experimental results validate the superiority of CODE under in-domain and cross-domain detection. For in-domain detection, CODE achieves a 5.80% FPR at 95% TPR vs. 30.3% by supervised baseline on the T5-Large and Delve domain. When sampling irrelevant documents from out-of-domain, the FPR of CODE decreases from 5.8% to 0.1%, while the FPR of the supervised baseline increases from 30.3% to 34.3%. For more insight, we highlight the importance of cross-attention, word frequency normalization, and integrating in-domain irrelevant documents during pretraining .
Paper Type: Long
Research Area: Generation
Research Area Keywords: Generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2651
Loading