BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers

Lilly Kumari; Shengjie Wang; Tianyi Zhou; Nikhil Sarda; Anthony Rowe; Jeff Bilmes

BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers

Lilly Kumari, Shengjie Wang, Tianyi Zhou, Nikhil Sarda, Anthony Rowe, Jeff Bilmes

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Compute efficient LMs

Keywords: transformers, KV cache selection, online summarization, Submodular Optimization, context summarization

TL;DR: A framework using a mixture of submodular functions for dynamic context (KV cache) summarization in LLMs for efficient inference and infinite-context transformers.

Abstract: Transformer-based Large Language Models (LLMs) have shown tremendous advancements across various domains. However, their need to maintain key-value representations (a KV cache) of previously seen tokens in the GPU memory leads to a significant memory overhead that scales linearly with the sequence length and batch size. With the advent of extremely long context LLMs, efficiently modeling long-range dependencies becomes challenging. In this work, we focus on the problem of long context summarization by formulating it as a subset selection problem. Specifically, we propose a novel submodular optimization framework called BumbleBee that uses a mixture of submodular functions to balance the diversity amongst the context tokens in the key embedding space and their importance computed using accumulated attention attributed to them across different input tokens. Our framework can work for both the LLM prefill and decoding phases, utilizing offline or online versions of our submodular algorithm respectively. While the context sizes grow to be as large only as the summary size, the temporal extent of the contexts may grow unboundedly, justifying the moniker ‘‘Infinite-Context Transformers.’’ Empirically, we validate the effectiveness of our framework across 13 different datasets using the LLaMA 7B and 13B models. Our results show that BumbleBee improves accuracy compared to state-of-the-art techniques at comparable context reduction ratios.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1316

Loading