Analytical Restructuring of Feed-Forward Networks for Accelerated LLM Inference

Zehua Pei; Hui-Ling Zhen; Lancheng Zou; Xianzhi Yu; Wulong Liu; Sinno Jialin Pan; Mingxuan Yuan; Bei Yu

Analytical Restructuring of Feed-Forward Networks for Accelerated LLM Inference

Zehua Pei, Hui-Ling Zhen, Lancheng Zou, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

03 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparsity, LLM

Abstract: Scaling large language models (LLMs) improves performance but dramatically increases inference costs, with feed-forward networks (FFNs) consuming the majority of computational resources. While sparse architectures like mixture-of-experts (MoE) can mitigate this, inducing sparsity in existing dense models typically requires extensive, resource-intensive retraining (often hundreds of billions of tokens), creating a prohibitive barrier to practical deployment. We propose a broadly applicable post-training framework that improves this performance–cost trade-off by enabling the rapid, analytical restructuring of FFNs into a sparse, efficient architecture. The framework operates by analyzing neuron activation patterns from a small calibration dataset, then analytically rebuilding the FFN into a Mixture-of-Experts-style architecture with always-active ``shared'' experts and conditionally activated ``routed'' experts. Critically, this process can restructure dense FFNs into sparse MoE architectures and can also be applied recursively to the experts within existing MoE models to create finer-grained hierarchical sparsity for further acceleration. We construct a differentiable router directly from activation statistics, enabling immediate deployment with a useful training-free baseline and serving as a robust foundation for optional, lightweight fine-tuning. Experiments validate our approach across diverse settings, delivering practical speedups reaching up to $1.17\times$ in compute-bound scenarios while providing consistent gains across all configurations. This is achieved with only minutes of processing time and minimal fine-tuning (2k samples), which favorably contrasts with methods requiring orders of magnitude more computational resources. By providing an efficient, analytical path to high-performance sparsity, the framework makes accelerated LLM deployment practical and accessible for resource-constrained environments.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 1404

Loading