In-Context Former: Lightning-fast Compressing Context for Large Language Model

ACL ARR 2024 June Submission2331 Authors

15 Jun 2024 (modified: 10 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the rising popularity of Transformer-based large language models (LLMs), reducing their high inference costs has become a significant research focus. One effective approach to mitigate these costs is compressing the long input contexts. Existing methods typically leverage the self-attention mechanism of the large model itself for context compression. While these methods have achieved notable results, the compression process still entails quadratic complexity. To overcome this limitation, we propose the In-Context Former (IC-Former). This method does not rely on the target large model but instead utilizes cross-attention mechanisms to extract and condense information from the contextual embeddings. The computational overhead of our method grows linearly with the compression range. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times while achieving 90\% of the baseline performance on evaluation metrics. Additionally, IC-Former demonstrates strong regularity in its interactions with the context, enhancing its interpretability. Overall, IC-Former significantly reduces compression costs, making real-time compression scenarios feasible.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: sentence compression, efficient models, NLP in resource-constrained settings
Languages Studied: English
Submission Number: 2331
Loading