Carbon-Aware RL-LLM Control for Energy-Efficient Liquid-Cooled HPC Data Centers

Avisek Naug; Sahand Ghorbanpour; Ashwin Ramesh Babu; Antonio Guillen-Perez; Vineet Gundecha; Ricardo Luna Gutierrez; Soumyendu Sarkar

Carbon-Aware RL-LLM Control for Energy-Efficient Liquid-Cooled HPC Data Centers

Avisek Naug, Sahand Ghorbanpour, Ashwin Ramesh Babu, Antonio Guillen-Perez, Vineet Gundecha, Ricardo Luna Gutierrez, Soumyendu Sarkar

Published: 30 Oct 2025, Last Modified: 04 Nov 2025MLForSys2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Liquid Cooling, RL-LLM Control, Sustainability, Data Center

TL;DR: Carbon-Aware RL-LLM Control for Energy-Efficient Liquid-Cooled HPC Data Centers

Abstract: The rapid growth of large language models (LLMs) in high-performance computing (HPC) data centers necessitates a shift from purely energy-efficient to carbon-aware control for liquid cooling systems. We introduce a novel multi-agent framework that leverages LLM-powered agents to achieve autonomous, carbon-aware thermal management. Our architecture features eight specialized agents coordinated via a hybrid Redis and Model Control Protocol (MCP) backbone for real-time operation. We validate our approach on a high-fidelity digital twin of the Frontier supercomputer's cooling system, focusing on a core contribution: a hybrid Reinforcement Learning (RL) and LLM control strategy. Experimental results show that our `RL $\rightarrow$ LLM` hybrid model significantly outperforms traditional baselines and other LLM configurations, achieving the lowest average blade temperatures (28.29°C) and the lowest carbon emissions (11.1 kg/hr), while maintaining operational stability. This work presents a practical blueprint for deploying agentic AI to create sustainable, efficient, and explainable control systems for complex cyber-physical infrastructure.

Submission Number: 29

Loading