Interpretable LLM Control for Sustainable Liquid Cooling in HPC Data Centers

Sahand Ghorbanpour; Ashwin Ramesh Babu; Avisek Naug; Antonio Guillen-Perez; Ricardo Luna Gutierrez; Vineet Gundecha; Soumyendu Sarkar

Interpretable LLM Control for Sustainable Liquid Cooling in HPC Data Centers

Sahand Ghorbanpour, Ashwin Ramesh Babu, Avisek Naug, Antonio Guillen-Perez, Ricardo Luna Gutierrez, Vineet Gundecha, Soumyendu Sarkar

Published: 01 Jul 2025, Last Modified: 28 Jul 2025CO-BUILD OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM in Sustainability, Sustainability, Interpretability, Data Centers, Liquid Cooling, LLM control, LLM-RL Hybrid Controller, Energy Efficiency, Real-Time Systems

TL;DR: We present an interpretable multi-agent system combining LLMs and RL to optimize liquid cooling in data centers for sustainability, achieving energy savings and improved reliability.

Abstract: The rise of AI workloads has driven the need for efficient liquid cooling in high-density data centers, yet current systems lack intelligent, interpretable control. We propose a novel framework combining Reinforcement Learning (RL) with Large Language Models (LLMs) to optimize end-to-end liquid cooling, from server cabinets to the cooling towers, while providing natural language explanations for control actions. Our approach includes a hybrid of a multi-agent Reinforcement Learning and a Large Language Model controller. Evaluated on a baseline of Oak Ridge National Lab's Frontier Supercomputer based scalable liquid cooling Modelica model, it improves temperature stability and energy efficiency, offering a scalable and transparent solution for sustainable data center cooling.

Submission Number: 2

Loading