FINE-GRAINED ENERGY PREDICTION FOR PARALLELIZED LLM INFERENCE WITH PIE-P

ICLR 2026 Conference Submission18308 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Energy Prediction, AllReduce, Tensor Parallelism, AllGather, Pipeline parallelism, Data Parallelism, LLMs, LLM Efficiency
TL;DR: We introduce PIE-P, an accurate, scalable, and fine-grained energy prediction framework for parallelized inference, addressing challenges of non-deterministic inter-GPU communication and synchronization overheads.
Abstract: With the widespread adoption of Large Language Models (LLMs), energy costs of running LLMs is quickly becoming a critical concern. However, precisely measuring the energy consumption of LLMs is often infeasible because hardware-based power monitors are not always accessible and software-based energy measurement tools are not accurate. While various prediction techniques have been developed to estimate LLM energy consumption, these approaches are limited to single-GPU environments and thus are not applicable to modern LLM inference which is typically parallelized across multiple GPUs. In this work, we remedy this gap and introduce PIE-P, a fine-grained energy prediction framework for multi-GPU inference, including tensor, pipeline, and data parallelism. Predicting the energy under parallelized inference is complicated by the non-determinism in inter-GPU communication, additional communication overheads, and difficulties in isolating energy during the communication/synchronization phase. We develop a scalable prediction framework that addresses these issues via precise sampling, fine-grained modeling of inter-GPU communication, and careful accounting of parallelization overheads. Our evaluation results show that PIE-P yields accurate and fine-grained energy predictions across parallelism strategies, significantly outperforming baselines.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 18308
Loading