Keywords: Multi tenant LLM serving, LLM Inference, Latency attribution
TL;DR: LLMVisor provides a fast and accurate per-request latency attribution model for multi-tenant LLM serving, enabling fair scheduling and reliable accounting across diverse models, GPUs, and workloads.
Abstract: As LLM inference shifts to multi-tenant GPU clusters, co-batching improves throughput but obscures per-tenant usage and limits control. Enabling fractional sharing of the inference engine requires a real-time, per-request attribution primitive that is accurate and light enough to run inside the scheduling loop. We present LLMVisor, a roofline-guided latency attribution model that captures the memory-bound and compute-bound phases via a concise piecewise-linear form over features proportional to FLOPs and memory I/O traffic. LLMVisor decomposes batch latency into additive, per-request shares and runs efficiently at microsecond (µs) scale. We evaluate LLMVisor across Llama3.1-8B and Qwen2.5-14B/32B on A100/H100 under varying tensor parallelism and workload mixes. Compared to a token-count proxy baseline, LLMVisor attains near-perfect R² and reduces relative error by up to 2.5×/3.3× (p90/p99) for prefill and 3.5×/4.4× for decode, despite batching variability and sequence divergence.
Submission Number: 67
Loading