EfficientLLM: Evaluating Large Language Models Efficiency

EfficientLLM: Evaluating Large Language Models Efficiency

ICLR 2026 Conference Submission13611 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Efficiency Benchmark, Architecture Pretraining

TL;DR: EfficientLLM introduces the first comprehensive benchmark systematically evaluating efficiency techniques for large language models across pretraining, fine-tuning, and bit-width quantification inference.

Abstract: Large Language Models (LLMs) have achieved remarkable advances across reasoning, generation, and problem-solving, yet their scaling comes with prohibitive training, deployment, and environmental costs. Training frontier models like GPT-3 or PaLM consumes thousands of GPU/TPU days and millions of dollars. As these costs escalate, there is a pressing need for rigorous benchmarks that quantify efficiency–performance trade-offs. However, existing evaluations remain inadequate: 1) they rely on narrow metrics such as FLOPs or latency, neglecting complementary dimensions like memory, throughput, energy, and compression, leading to mischaracterized efficiency; 2) they are often limited to small models or a single hardware setup, making conclusions difficult to generalize to billion-parameter deployments across diverse accelerators; and 3) they fragment coverage across pretraining, fine-tuning, or inference, failing to provide an end-to-end perspective on the full lifecycle of model efficiency. To address these gaps, we present \textbf{EfficientLLM}, the first large-scale empirical benchmark that systematically quantifies efficiency–performance trade-offs across the entire lifecycle of LLMs. 1) First, to overcome missing multi-dimensional metrics, EfficientLLM unifies six orthogonal dimensions into a consistent evaluation framework. 2) Second, to address scale and hardware diversity, we evaluate over 150 model–technique pairs spanning 0.5B–72B parameters on production-class clusters with 48*GH200, 8*H200, and 8*A100 accelerators, ensuring conclusions generalize to realistic deployment conditions. 3) Third, to provide end-to-end lifecycle coverage, EfficientLLM benchmarks architectural pretraining, fine-tuning, and bit-width quantization. By systematically resolving these three limitations, EfficientLLM establishes the most comprehensive benchmark to date for evaluating efficiency in large-scale models. Our results not only highlight critical trade-offs between accuracy, cost, and sustainability but also offer actionable guidance for both academic researchers and industrial practitioners in designing, training, and deploying the next generation of foundation models. All code and datasets are released as an open-source toolkit, accessible via \texttt{pip install efficientllm-toolkit}.

Primary Area: datasets and benchmarks

Submission Number: 13611

Loading