Multi-modal spiking tensor regression network for audio-visual zero-shot learning

Zhe Yang, Wenrui Li, Jinxiu Hou, Guanghui Cheng

Published: 01 Jan 2025, Last Modified: 21 Jul 2025Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, convolutional neural networks have got significant attention, particularly in the field of audio-visual zero-shot learning. It can accurately perceive and capture local features, which allows the model to effectively obtain the corresponding attributes. The original multilinear structure is disrupted when the tensor is flattened as it passes through the fully connected layers. Inspired by this, we introduce a multi-modal spiking tensor regression network (MSTR). MSTR incorporates tensor regression networks with tensor contractions and spiking neural networks featuring threshold adjustments, thus effectively handling temporal and spatial information. It facilitates fine-grained feature extraction while retaining high-dimensional spatial information. Specifically, we use Spiking Neural Networks (SNN) to encode temporal features, and Tensor Regression Networks (TRN) to encode spatial features. Our proposed Temporal-Spatial-Semantic Fusion block combines temporal, spatial, and semantic features for each modality. Finally, the fused audio and visual features pass through a series of cross-modal transformers, further exploring the inner relationship between each modalities. Experimental results on three benchmark datasets, ActivityNet, VGGSound, and UCF, demonstrate that MSTR demonstrates superiorities compared with state-of-the-art models, with significant improvements in harmonic mean (HM) scores on three datasets of 6.0%, 6.8%, and 2.2%, respectively. The code and pre-trained models are available at https://github.com/xia-zhe/MSTR.