Towards Energy-efficient Audio-visual Classification via Multimodal Interactive Spiking Neural Network

Published: 28 Jan 2025, Last Modified: 28 Jan 2026TOMMEveryoneCC BY 4.0
Abstract: The Audio-visual Classification (AVC) task aims to determine video categories by integrating audio and visual signals. Traditional methods for AVC leverage Artificial Neural Networks (ANNs) that operate on floating-point features, affording large parameter counts and consuming extensive energy. Recent research has shifted towards brain-inspired Spiking Neural Networks (SNNs), which transmit audiovisual information through sparser 0/1 spike features allowing for better energy efficiency. However, a byproduct of such sparsity is the increased difficulty in effectively encoding and utilizing these spike features. Moreover, the spike firing characteristics based on neuron membrane potential cause asynchronous spike activations due to the heterogeneous distributions of different modalities in the AVC task, resulting in cross-modal asynchronization. This issue is often overlooked by prior SNN models, resulting in lower classification accuracy compared to traditional ANN models. To address these challenges, we present a new Multimodal Interaction Spiking Network (MISNet), the first to successfully balance both accuracy and efficiency for the AVC task. As the core of MISNet, we propose a Multimodal Leaky Integrate-and-fire (MLIF) neuron, which coordinates and synchronizes the spike activations of audiovisual signals within a single neuron, distinguishing it from the prior paradigm of SNNs that relies on multiple separate processing neurons. As a result, our MISNet enables to generate audio and visual spiking features with effective cross-modal fusion. Additionally, we propose to add extra loss regularizations before fusing the obtained audio-visual features for final classification, thereby benefiting unimodal spiking learning for multimodal interaction. We evaluate our method on five audio-visual datasets, demonstrating advanced performance in both accuracy and energy consumption.
Loading