SpikeCLIP: A contrastive language-image pretrained spiking neural network

Published: 01 Jan 2025, Last Modified: 01 Aug 2025Neural Networks 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Spiking-Based Multimodal Feature Alignment: This work is among the first to demonstrate that multimodal features extracted from text and images can be effectively aligned using spike train representations. These aligned representations enable zero-shot prediction of concept categories in previously unseen inputs.•Novel Training Algorithm for Multimodal SNNs: We propose a two-step method for training multimodal SNNs, which includes pre-training for cross-modal alignment via knowledge distillation, followed by dual-loss fine-tuning with surrogate gradients.•Comprehensive Experimental Evaluation: We conduct extensive experiments to assess the performance of SpikeCLIP on image classification tasks. Additionally, we perform ablation studies to demonstrate the model’s zero-shot capability and its reduction in energy consumption.
Loading