KVTQ: Compressing the KV Cache to Hardware Efficient Ternary Digits by Fine-Grained Dynamic Quantization

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: compression, dynamic quantization, ternary digits, KV cache
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We show that the KV cache of LLMs can be quantized to ternary digits without significant loss of accuracy, which leads to significant computational and usability improvements.
Abstract: Large language models(LLMs) exhibit capabilities beyond expectations in various NLP tasks. Since the inference of LLM consumes huge resources, optimizing the inference process of LLM is of great significance to promote the application of LLM. In the text generation process, caching the key-value embeddings (KV cache) for subsequent generation process is a basic optimization method. However, huge size of the KV cache limits the inference batch size. Compressing the space occupied by the cached key-value embeddings can enlarge the batch size of LLM inference to improve throughput. Besides, based on the analysis of the usage mode of the KV cache, we find compressing the KV cache to ternary digits can not only compress the space occupied by the KV cache, but also greatly reduce the required multiplication operation in the attention block. Combined with the numerical features of the KV cache, we propose KVTQ, a method which compresses the KV cache to hardware efficient ternary digits. We validate our KVTQ method on different series of LLMs and get the conclusion that the KVTQ method which compresses the KV cache to ultra-low bits can still preserve the model quality.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4580
Loading