Action Recognition in Videos using 3D ConvNets with Multi-head Attention

Yagmur Sahin, Mustafa Sert

Published: 2023, Last Modified: 21 Apr 2024BigMM 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning continues to reshape the potential of video action recognition, offering a robust framework to capture the basics of dynamic human movements. Despite the progress, efficient recognition of actions in videos remains a challenge. In our study, we introduce a comprehensive deep-learning model that effectively addresses this challenge. Specifically, our model is based on the spatiotemporal 3D convolutional neural network architecture with a multi-head attention mechanism to distinguish complex actions across video frames. The multi-layered architecture, including multiple linear layers and a multi-head attention mechanism, ensures both depth and precision in action classification. This paper elaborates throughout our study from raw video inputs to recognized actions, emphasizing the transformative potential of deep learning in enhancing action recognition capabilities. Our results on two benchmark datasets, namely UCF and HMDB, demonstrate that our proposed architecture improves the 3D ConvNet performance.