LLMVisor: A Real-Time Latency Attribution Model for Multi-Tenant LLM Serving

Shuowei Jin; Xueshen Liu; Jiaxin Shan; Le Xu; Tieying Zhang; Liguang Xie; Zhuoqing Mao

LLMVisor: A Real-Time Latency Attribution Model for Multi-Tenant LLM Serving

Shuowei Jin, Xueshen Liu, Jiaxin Shan, Le Xu, Tieying Zhang, Liguang Xie, Zhuoqing Mao

Published: 30 Oct 2025, Last Modified: 04 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi tenant LLM serving, LLM Inference, Latency attribution

TL;DR: LLMVisor provides a fast and accurate per-request latency attribution model for multi-tenant LLM serving, enabling fair scheduling and reliable accounting across diverse models, GPUs, and workloads.

Abstract: As LLM inference shifts to multi-tenant GPU clusters, co-batching improves throughput but obscures per-tenant usage and limits control. Enabling fractional sharing of the inference engine requires a real-time, per-request attribution primitive that is accurate and light enough to run inside the scheduling loop. We present LLMVisor, a roofline-guided latency attribution model that captures the memory-bound and compute-bound phases via a concise piecewise-linear form over features proportional to FLOPs and memory I/O traffic. LLMVisor decomposes batch latency into additive, per-request shares and runs efficiently at microsecond (µs) scale. We evaluate LLMVisor across Llama3.1-8B and Qwen2.5-14B/32B on A100/H100 under varying tensor parallelism and workload mixes. Compared to a token-count proxy baseline, LLMVisor attains near-perfect R² and reduces relative error by up to 2.5×/3.3× (p90/p99) for prefill and 3.5×/4.4× for decode, despite batching variability and sequence divergence.

Submission Number: 67

Loading