How to Get Spiking LLMs? A Dual ANN-to-SNN Conversion with Layer-Wise Calibration

Wanli Shi; Haoran Fang; Jinjie Fang; Haozhen Zhang; Runwen You; Tianshuo Chen; Zhaogeng Liu; Yi Chang; Bin Gu

How to Get Spiking LLMs? A Dual ANN-to-SNN Conversion with Layer-Wise Calibration

Wanli Shi, Haoran Fang, Jinjie Fang, Haozhen Zhang, Runwen You, Tianshuo Chen, Zhaogeng Liu, Yi Chang, Bin Gu

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: calibration, spiking neuron network, large language model, ann-to-snn conversion

Abstract: With rising concerns about data privacy, deploying large language models (LLMs) on edge devices rather than relying solely on cloud-based solutions is becoming increasingly essential. Nonetheless, the constrained power and computational capacity of edge hardware frequently hinder the practical deployment of LLMs. Spiking Neural Networks (SNNs) have gained attention as a viable alternative, offering brain-inspired efficiency and low power consumption, making them ideal for edge deployment. Among various SNN training strategies, ANN-to-SNN conversion stands out for its relatively low computational cost compared to training spiking networks from scratch. However, conventional conversion methods still require a specially trained, conversion-friendly ANN, which becomes prohibitively expensive when applied to large-scale models like LLMs. To address this limitation, we propose a novel ANN-to-SNN conversion framework that can be regarded as a dual version of conventional conversion methods. Built on quantized LLMs, our approach eliminates the need to train a dedicated ANN tailored for conversion. A key challenge in such conversions is the temporal dynamics of spike arrivals—commonly known as unevenness error—which can cause significant performance degradation. To mitigate this issue, we introduce a parameter-efficient, layer-wise calibration technique that effectively reduces conversion errors, particularly unevenness error, while keeping computational overhead minimal. Theoretical analysis demonstrates that our calibration method substantially lowers the final conversion error between the original LLM and its spiking counterpart. Extensive experiments on LLaMA models show that our method achieves performance comparable to state-of-the-art quantization techniques, highlighting its effectiveness.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 15417

Loading