Auxiliary audio-textual modalities for better action recognition on vision-specific annotated videos
Abstract: Highlights•Introduced a novel framework for improved human activity recognition using audio–visual data.•Employed pre-trained language models to bridge audio and video datasets.•Developed a learnable mechanism to selectively ignore irrelevant audio modalities.•Proposed an efficient video Transformer that processes visual data with fewer parameters.•Achieved superior performance on benchmark datasets, outperforming existing methods.
Loading