HiMA: Efficient Hybrid Model Serving for Agentic Systems

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agent inference
TL;DR: HiMA improves agent inference efficiency by dynamically reallocating HBM between a hybrid model's KV and recurrent state pools at runtime via an admission-gated, page-grain CUDA VMM transfer.
Abstract: Hybrid models (e.g., Qwen3.5, Kimi Linear) are increasingly adopted for their strong capabilities in agentic applications, yet their two memory pools (a KV cache plus recurrent-state snapshots) break two assumptions of existing serving systems (e.g., SGLang, vLLM). First, per-pool LRU ignores the per-byte yield asymmetry between block types: one byte of snapshot saves a full chunked-scan replay, while one byte of KV saves only one block's prefill. Second, a static inter-pool partition wastes HBM as agentic workloads (agent swarms, long-horizon agents) shift the KV/recurrent mix at runtime. We introduce HiMA, which addresses the two issues with a unified intra-pool eviction order and a two-timescale inter-pool allocator. First, a loss-per-byte (LPB) score, defined as recovery time per evicted byte, merges KV blocks and recurrent snapshots into a single eviction queue. Second, HiMA runs two coupled convex programs on separated timescales: a slow \emph{Budgeter} periodically re-balances per-pool HBM, and a fast per-arrival \emph{Admitter} picks the cheapest of five candidates (own- or cross-pool admit, own- or cross-pool evict, or defer); both share a single shadow price (the marginal value of one HBM byte) so their decisions stay consistent. We implement HiMA as a plugin to SGLang; across diverse agent-serving workloads it delivers \textbf{7--94%} output throughput gains and \textbf{25--64%} P99 Time to First Token (TTFT) reductions. Our analysis shows the per-arrival rule tracks the offline optimum at second-order regret in transfer overhead and queue-depth noise.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 230
Loading