SpikingLLM: A Conversion-Based Method with Window Inhibition Mechanism for Spiking Large Language Models

Zekai Xu; Hanqi Li; Kang You; Chen Nie; Qinghai Guo; Xiang Wang; Zhezhi He

SpikingLLM: A Conversion-Based Method with Window Inhibition Mechanism for Spiking Large Language Models

Zekai Xu, Hanqi Li, Kang You, Chen Nie, Qinghai Guo, Xiang Wang, Zhezhi He

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models; Spiking Neural Networks; Quantized Large Language Models;

Abstract: Recent advancements in large language models (LLMs) have led to unprecedented capabilities in real-world applications. However, it remains challenging to reduce the energy consumption of LLMs. In this paper, we aim to improve the energy efficiency of LLMs by leveraging the advantages of brain-inspired spiking neural networks (SNNs). We propose a novel approach called SpikingLLM, which equivalently converts quantized large language models (QLLMs) applying PrefixQuant* to their fully-spiking counterparts(all operators are in a more efficient spiking version). To ensure that every operator can be converted into its spiking version, we propose two approaches: ① QK2Head-migration post-softmax quantization, which significantly improves the performance of current QLLMs with post-softmax quantization; ② Differential-based methods, which tackle the SNN-unfriendly operators such as KV Cache. To further reduce the energy consumption, we introduce a window inhibition mechanism which effectively addresses the over-firing issue in ST-BIF+ neuron and improves the sparsity. With the approaches above, SpikingLLM significantly reduces the energy consumption while achieving state-of-the-art performance on both perplexity and common-sense reasoning tasks.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3920

Loading