MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

Zhaoyuan Su; Zeyu Zhang; Tingfeng Lan; Zirui Wang; Haiying Shen; Juncheng Yang; Yue Cheng

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Haiying Shen, Juncheng Yang, Yue Cheng

Published: 19 Mar 2026, Last Modified: 20 May 2026MLSys 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM serving, dynamic mixed-precision inference, runtime layer swapping, elastic KV cache, workload-aware adaptation, SLO compliance

TL;DR: MorphServe is a workload-aware LLM serving framework that responds to real-time pressure by swapping quantized layers and elastically resizing KV cache capacity at runtime, reducing SLO violations and tail latency while preserving generation quality.

Abstract: Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. To deal with these issues, we present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which repurposes the freed memory to dynamically expand KV cache capacity. These mechanisms enable state-preserving transitions that jointly coordinate weight precision and KV capacity at runtime. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45% and improves P95 TTFT by 2.2×–3.9× over full-precision serving, without compromising generation quality. Compared to planning-based quantization methods, MorphServe reduces average accuracy degradation by 41.3%, and lowers P95 TTFT by up to 2.4× over KV cache compression while maintaining higher generation quality. These results establish MorphServe as a practical and elastic solution that effectively navigates the accuracy–efficiency Pareto frontier under dynamic LLM serving workloads.

Topics: Model Serving: System optimizations for model serving, Resource Management: Auto-scaling and resource elasticity

Submission Number: 65

Loading