Abstract: The increasing size and context length of large language models (LLMs) poses significant challenges for memory usage during inference, limiting their deployment on edge devices. Post-training quantization (PTQ) offers a promising solution by reducing memory requirements and improving computational efficiency, but aggressive PTQ methods often lead to significant degradation of performance. To address this, we propose LazyQuant, leveraging two key insights based on runtime information during LLM inference process: (1) the precision of initial key-value (KV) cache segments strongly influences model performance, and (2) space for the KV cache can be allocated later during inference. Instead of relying on static, fully quantized weights, LazyQuant reduces weight size only when memory is tight—leveraging previously generated KV caches, created with higher-precision weights, to mitigate precision loss. Our pilot experiments show LazyQuant surpasses state-of-the-art methods under limited memory budgets.
External IDs:doi:10.1007/978-981-95-7078-2_48
Loading