SpikingLLM: Spiking Large Language Models with Causal Spiking Self-Attention and Spike-Form Knowledge Distillation

Sihang Guo; Chenlin Zhou; Jiaqi Wang; Kehai Chen; Qingyan Meng; Zhengyu Ma

SpikingLLM: Spiking Large Language Models with Causal Spiking Self-Attention and Spike-Form Knowledge Distillation

Sihang Guo, Chenlin Zhou, Jiaqi Wang, Kehai Chen, Qingyan Meng, Zhengyu Ma

16 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spiking Neural Networks, Energy Efficiency, Language Modeling

Abstract: Spiking Neural Networks (SNNs) offer promising energy-efficient alternatives to large language models (LLMs) due to their event-driven nature and ultra-low power consumption. However, to retain representation capacity, most existing spiking LLM approaches rely on integer activations or softmax, which involve intensive floating-point operations and undermine inference efficiency. Moreover, the intrinsic spatial-temporal optimization of spiking networks further increase the direct training cost and difficulty. To address these challenges, we propose \textbf{SpikingLLM}, the first fully binary spike-driven spiking LLM framework developed from random initialization, without reliance on floating-point matrix multiplications or softmax. At the core of SpikingLLM is the \textbf{Causal Spiking Self-Attention (CSSA)} mechanism, which replaces conventional softmax with binary spike-based operations and thereby enables autoregressive language modeling in the spiking domain, ensuring low-cost inference. To support cost-efficient training under constrained computational budgets, we further introduce \textbf{Spike-Form Knowledge Distillation (SKD)}, a multi-level distillation strategy that aligns ANN teacher and SNN student across embeddings, attention maps, intermediate features, and output logits. SKD framework allows SpikingLLM to achieve competitive performance with ANN counterparts using substantially fewer training tokens (e.g., 1.0B tokens for a 0.125B model and 10.0B tokens for a 1.3B model), resulting in effective training. As a result, SpikingLLM achieves ANN-level performance at only \textbf{4.16\%–5.87\%} of the computational cost on natural language generation tasks. Our results highlight the feasibility and effectiveness of fully binary spike-driven LLMs and establish the distillation as a promising pathway for energy-efficient, brain-inspired spiking NLP.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7591

Loading