Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use

ICLR 2026 Conference Submission12808 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM inference efficiency, Quantization, Batching strategies, Serving infrastructure, Energy-aware AI, Latency modeling, Request scheduling, Sustainable deployment
TL;DR: LLM inference energy depends less on the model itself than on precision, batching, and serving. Quantization helps only in compute-bound phases, batching cuts per-token energy, and request shaping with TGI yields up to 100× efficiency gains.
Abstract: Large Language Models (LLMs) are increasingly deployed in production, shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \textbf{system-level design choices} - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face’s Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases like decoding; and that structured request timing (arrival shaping) can reduce per-request energy by up to $100\times$. We argue that sustainable LLM deployment depends not only on model internals, but also on the orchestration of the serving stack. Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 12808
Loading