SpikeCLIP: A contrastive language-image pretrained spiking neural network

Changze Lv; Tianlong Li; Wenhao Liu; Yufei Gu; Jianhan Xu; Cenyuan Zhang; Muling Wu; Xiaoqing Zheng; Xuanjing Huang

SpikeCLIP: A contrastive language-image pretrained spiking neural network

Changze Lv, Tianlong Li, Wenhao Liu, Yufei Gu, Jianhan Xu, Cenyuan Zhang, Muling Wu, Xiaoqing Zheng, Xuanjing Huang

Published: 01 Jan 2025, Last Modified: 19 Sept 2025Neural Networks 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•Spiking-Based Multimodal Feature Alignment: This work is among the first to demonstrate that multimodal features extracted from text and images can be effectively aligned using spike train representations. These aligned representations enable zero-shot prediction of concept categories in previously unseen inputs.•Novel Training Algorithm for Multimodal SNNs: We propose a two-step method for training multimodal SNNs, which includes pre-training for cross-modal alignment via knowledge distillation, followed by dual-loss fine-tuning with surrogate gradients.•Comprehensive Experimental Evaluation: We conduct extensive experiments to assess the performance of SpikeCLIP on image classification tasks. Additionally, we perform ablation studies to demonstrate the model’s zero-shot capability and its reduction in energy consumption.

Loading