Joint Cooling and Computing Optimization for Language Model Serving

Published: 30 Sept 2025, Last Modified: 24 Nov 2025urbanai PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimization, LLM, LLM Serving
Abstract: AI data centers are now deployed at a massive scale, supporting the deployment and serving of power-hungry large language models (LLMs). The sheer volume of both computing and cooling inside data centers raises growing concerns about LLM's energy and emission impacts. While the energy efficiency of LLM inference have been studied recently, most prior work focuses on compute-side scheduling and optimizations without explicitly accounting for thermal objectives or constraints. While intensive computing in GPUs can emit a lot of heat, which in turn affects data center performance, such an oversight can inadvertently increase the overall energy consumption or reduce efficiency of LLM servers. To address this gap, we propose a joint modeling process for cooling and computing inside AI data centers and a novel hierarchical control framework that co-optimizes computing and thermal management by jointly tuning GPU parallelism, frequency (DVFS), and cooling control knobs. Using real Azure inference traces and detailed GPU profiling, our model balances serving latency with both energy efficiency and thermal requirements.
Submission Number: 58
Loading