Keywords: Energy-efficient LLMs; consumer GPU
TL;DR: LLM workloads alternating compute- and memory-bound ops can cause power spikes that throttle GPUs. Simple chunking smooths these spikes, improving performance and energy efficiency by up to 20% on edge devices and 1–2% on datacenter GPUs.
Abstract: Energy supply and heat dissipation are two of the main challenges with modern GPU deployments. While typically discussed in the context of new datacenter constructions, the same constraints also apply to small form-factor consumer devices, such as the DGX spark. In workloads characterized by alternating compute-intensive tasks such as matmuls with memory-bound operations such as norms or cross-entropy, the compute-intensive parts might hit power and/or thermal limits and start throttling. In this short paper, we show that _chunking_ the workload into smaller parts that alternate compute and memory in higher frequencies, these power and temperature spikes can be smoothed out, preventing throttling and resulting in considerably faster wallclock time _and_ reduced total energy consumption. We present several scenarios in which this effect can be exploited on a DGX Spark with up to 20% performance and energy improvements, and demonstrate that the same phenomenon also happens on less constrained systems, such as a multi-GPU server, albeit at significantly reduced effect size of 1-2%.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 216
Loading