${\text{CA}^{2}\text{ST}}$: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Jongseo Lee, Joohyun Chang, Dongho Lee, Jinwoo Choi

Published: 2026, Last Modified: 11 May 2026IEEE Trans. Pattern Anal. Mach. Intell. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose Cross-Attention in Audio, Space, and Time (C$\text{A}^{2}$ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, Kinetics-400, ActivityNet, and HD-EPIC to show balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, EPIC- SOUNDS, and HD-EPIC-SOUNDS. CAVA shows favorable performance on these datasets, demonstrating the effective information exchange among multiple experts within the B-CA module. In addition, C$\text{A}^{2}$ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

External IDs:dblp:journals/pami/LeeCLC26