Retrieval-Augmented Generation Inspired Long Document Classification

ACL ARR 2024 June Submission1173 Authors

14 Jun 2024 (modified: 08 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While being proven to be effective across nearly all natural language processing tasks, transformer-based models have several obvious limitations. Amongst them, arguably the most significant one is the quadratic complexity -- both in time and space -- of the vanilla self-attention mechanism. As a result, most existing pre-trained language models, such as BERT, have a fixed maximum context window. This potentially creates a mismatch between the context window size and the data applied to fine-tuning it. This gives rise to the study of long document classification -- the task to optimize performance when the length of the input document exceeds the model's maximum token. Inspired by retrieval-augmented generation techniques used by large language models in recent years, we propose a method that uses similar techniques to retrieve the most relevant sections of a long document, which is then fed into a traditional transformer-based model. By testing on four standard long document classification datasets, we show that our proposed method on average outperforms all baselines, including both transformer and non-transformer based models.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: financial/business NLP,legal NLP
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1173
Loading