Temporal spatial semantic fusion network for audio-visual zero-shot learning

Ming Guo, Feng Chen, Chongjun Wang

Published: 28 May 2025, Last Modified: 05 Jun 2025Intelligent Data AnalysisEveryoneCC BY 4.0

Abstract: Due to the scarcity of annotated real-world data for specific categories, audio-visual generalized zero-shot learning (GZSL) has attracted significant attention. GZSL aims to classify novel classes absent during training while ensuring stable performance on seen classes. However, most existing methods operate implicitly, often neglecting the effective utilization of temporal, spatial, and semantic consistency. To address these challenges, we propose the Temporal Spatial Semantic Fusion network (TSSF). Specifically, we explore both audio and visual modalities using a multi-branch, multi-grained structure comprising a temporal global extraction module, a spatial local refinement module, and a multi-grained fusion module. The temporal global extraction module employs a Transformer-based Spiking Neural Network to extract explicit temporal representations and capture global dependencies. Simultaneously, the spatial local refinement module focuses on spatial information and local details using a window attention mechanism. Furthermore, the temporal and spatial features are hierarchically fused in the multi-grained fusion module, which incorporates both temporal and spatial attention for semantic enrichment. To explore multi-modal interactions, we enhance audio and visual features through cross-modal attention, followed by multi-modal alignment with text embeddings. Experiments on three benchmark audio-visual datasets validate the superiority of our method over state-of-the-art approaches. Notably, TSSF achieves significant improvements of 30.34% and 6.96% in HM and ZSL metrics on the VGG-GZSL dataset.