EfficientLLM: Unified Pruning-Aware Pretraining for Auto-Designed Edge Language Models

20 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM pruning, efficient language models
Abstract: Modern large language models (LLMs) driven by scaling laws achieve emergent intelligence in large model sizes. Recently, the increasing concerns about cloud costs, latency and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the unified pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-Scalable Pruning: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Auto-Designed Architecture: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. It achieves top-quality edge language models, termed EfficientLLM, by the unification stage of pruning, pretraining, and auto-architecture design. EfficientLLM significantly outperforms SoTA baselines with $100M \sim 1B$ parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in commen sense benchmarks. As the first attempt, EfficientLLM bridges the performance gap between post-training LLM compression and direct pretraining methods, and we fully open source EfficientLLM for future advancements.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23174
Loading