Breaking the Ice: Analyzing Cold Start Latency in vLLM

Huzaifa Shaaban Kabakibo; Animesh Trivedi; Lin Wang

Breaking the Ice: Analyzing Cold Start Latency in vLLM

Huzaifa Shaaban Kabakibo, Animesh Trivedi, Lin Wang

Published: 19 Mar 2026, Last Modified: 20 May 2026MLSys 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, vLLM, cold start latency, startup latency, performance characterization, benchmarking

TL;DR: We present the first systematic breakdown and modeling of vLLM cold-start latency, providing an interpretable predictive model for startup performance across models and hardware.

Abstract: As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de-facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study on the startup latency of its engine. With major architectural innovations under it (e.g., the `V1` API, introduction of `torch.compile`), in this paper, we present the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that this process is predominantly CPU-bound. Each step exhibits consistent and interpretable scaling trends with respect to model- and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM's startup latency for a given hardware configuration, providing actionable guidance for resource planning in large-scale inference environments. All our benchmarking datasets, analysis tools, and prediction scripts are open-sourced at: https://github.com/upb-cn/vllm-startup-profiler

Topics: Benchmarks, Datasets, and Evaluation: Benchmarks for training, inference, and efficiency, Benchmarks, Datasets, and Evaluation: Testing, debugging, monitoring, and reproducibility of ML applications, Model Serving: System optimizations for model serving

Submission Number: 72

Loading