Keywords: LLM, vLLM, cold start latency, startup latency, performance characterization, benchmarking
TL;DR: We present the first systematic breakdown and modeling of vLLM cold-start latency, providing an interpretable predictive model for startup performance across models and hardware.
Abstract: As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de-facto inference engine of choice for many inference workloads.
Although popular, due to its complexity and rapid evolution, there has not been a systematic study on the startup latency of its engine.
With major architectural innovations under it (e.g., the `V1` API, introduction of `torch.compile`), in this paper, we present the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that this process is predominantly CPU-bound. Each step exhibits consistent and interpretable scaling trends with respect to model- and system-level parameters, enabling fine-grained attribution of latency sources.
Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM's startup latency for a given hardware configuration, providing actionable guidance for resource planning in large-scale inference environments.
All our benchmarking datasets, analysis tools, and prediction scripts are open-sourced at: https://github.com/upb-cn/vllm-startup-profiler
Topics: Benchmarks, Datasets, and Evaluation: Benchmarks for training, inference, and efficiency, Benchmarks, Datasets, and Evaluation: Testing, debugging, monitoring, and reproducibility of ML applications, Model Serving: System optimizations for model serving
Submission Number: 72
Loading