Carbon-Aware RL-LLM Control for Energy-Efficient Liquid-Cooled HPC Data Centers

Published: 30 Oct 2025, Last Modified: 04 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Liquid Cooling, RL-LLM Control, Sustainability, Data Center
TL;DR: Carbon-Aware RL-LLM Control for Energy-Efficient Liquid-Cooled HPC Data Centers
Abstract: The rapid growth of large language models (LLMs) in high-performance computing (HPC) data centers necessitates a shift from purely energy-efficient to carbon-aware control for liquid cooling systems. We introduce a novel multi-agent framework that leverages LLM-powered agents to achieve autonomous, carbon-aware thermal management. Our architecture features eight specialized agents coordinated via a hybrid Redis and Model Control Protocol (MCP) backbone for real-time operation. We validate our approach on a high-fidelity digital twin of the Frontier supercomputer's cooling system, focusing on a core contribution: a hybrid Reinforcement Learning (RL) and LLM control strategy. Experimental results show that our `RL $\rightarrow$ LLM` hybrid model significantly outperforms traditional baselines and other LLM configurations, achieving the lowest average blade temperatures (28.29°C) and the lowest carbon emissions (11.1 kg/hr), while maintaining operational stability. This work presents a practical blueprint for deploying agentic AI to create sustainable, efficient, and explainable control systems for complex cyber-physical infrastructure.
Submission Number: 29
Loading