Keywords: KV cache quantization, resource-adaptive inference, foundation model deployment, LLM safety, quantized inference, FP8 KV cache, alignment robustness, memory-safety tradeoffs, deployment-time auditing
TL;DR: KV cache quantization can silently break LLM safety alignment while preserving perplexity; we propose a 35-minute deployment audit that diagnoses and mitigates this failure.
Abstract: Key--value (KV) cache quantization is now a production default in LLM serving (vLLM, TensorRT-LLM, SGLang), yet standard quality metrics (perplexity, task accuracy, latency) have a blind spot: they cannot detect whether the model still refuses harmful requests after compression. We close that gap. Measuring eleven instruction-tuned models (3.8B-72B) on 1,894 prompts, we find that low-bit KV quantization can silently dismantle safety alignment. Mistral-7B sheds 15.2\% of its refusals at a perplexity ratio of $1.03\times$; collapse onsets span four bits across families; and no universal safe bit-width exists. The same vulnerability shows up under vLLM with FP8 KV cache: the standard fp8_e5m2 format causes a 30.3\% conditional flip on Qwen-2.5-7B, roughly $150\times$ worse than simulated uniform 8-bit. We propose **Per-Channel Reduction** (PCR), a 20-prompt diagnostic that places each model into one of three failure modes (*outlier-crushes-safety*, *outlier-as-safety*, or *multi-layer dilution*) and prescribes a corresponding mitigation. PCR's directional predictions hold across six independent axes (held-out models, the KIVI quantizer, scheme transfer, layer-selection baselines, fresh prompts, system-prompt interventions), and the full audit fits inside a $\sim$35\,GPU-minute training-free protocol that
recovers up to 97\% of lost alignment at 0-7\% memory overhead, enabling model-adaptive KV cache compression that replaces
one-size-fits-all bit-width selection with geometry-informed
per-layer precision.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 178
Loading