Abstract: The rapid increase in LLM ubiquity and scale levies unprecedented demands on computing infrastructure. These demands not only incur large compute and memory resources but also significant energy, yielding large operational and embodied carbon emissions. In this work, we present three main observations based on modeling and traces from the production deployment of two Generative AI services in a major cloud service provider. First, while GPUs dominate operational carbon, host processing systems (e.g., CPUs, memory, storage) dominate embodied carbon. Second, offline, batch inference accounts for a significant portion (up to 55\%) of serving capacity. Third, there are different levels of heterogeneity across hardware and workloads for LLM inference. Based on these observations, we design EcoServe, a carbon-aware resource provision and scheduling framework for LLM serving systems. It is based on four principles - Reduce, Reuse, Rightsize, and Recycle (4R). With a cross-stack ILP formulation and design, we demonstrate that EcoServe can lower carbon emissions by up to 47\%, compared to performance, energy, and cost-optimized design points, while maintaining performance targets and SLOs.
Loading