Revealing The Intrinsic Ability of Generative Text Summarizers for Outlier Paragraph Detection

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Outlier Paragraph Detection, Generative Language Models, Cross Attention
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Generative text summarizers are good at content encapsulation but falter when outlier paragraphs disrupt the primary narrative. We categorize these outliers into cross-document outliers that are thematically inconsistent but within the same domain, and cross-domain outliers, originating from distinct domains. Traditional methods lean on word embeddings and specialized classifiers, requiring extensive supervised fine-tuning. Confidence-based strategies, despite bypassing fine-tuning, are ill-suited due to the non-classification essence of summarization. Leveraging the encoder-decoder cross-attention framework, we introduce an approach emphasizing the unique characteristics of infrequent words in detection. We present CODE, a novel outlier detector using a closed-form expression rooted in cross-attention scores. Our experimental results validate the superiority of CODE under different datasets and architectures, e.g., achieving a 5.80\% FPR at 95\% TPR vs. 25.63\% by supervised baselines on T5-Large and Delve domain. We further underscore the significance of cross-attention, word frequency normalization and judicious integration of cross-document outliers during pretraining.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5653
Loading