Auxiliary audio-textual modalities for better action recognition on vision-specific annotated videos

Published: 01 Jan 2024, Last Modified: 05 Nov 2024Pattern Recognit. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Introduced a novel framework for improved human activity recognition using audio–visual data.•Employed pre-trained language models to bridge audio and video datasets.•Developed a learnable mechanism to selectively ignore irrelevant audio modalities.•Proposed an efficient video Transformer that processes visual data with fewer parameters.•Achieved superior performance on benchmark datasets, outperforming existing methods.
Loading