Abstract: Deep learning continues to reshape the potential of video action recognition, offering a robust framework to capture the basics of dynamic human movements. Despite the progress, efficient recognition of actions in videos remains a challenge. In our study, we introduce a comprehensive deep-learning model that effectively addresses this challenge. Specifically, our model is based on the spatiotemporal 3D convolutional neural network architecture with a multi-head attention mechanism to distinguish complex actions across video frames. The multi-layered architecture, including multiple linear layers and a multi-head attention mechanism, ensures both depth and precision in action classification. This paper elaborates throughout our study from raw video inputs to recognized actions, emphasizing the transformative potential of deep learning in enhancing action recognition capabilities. Our results on two benchmark datasets, namely UCF and HMDB, demonstrate that our proposed architecture improves the 3D ConvNet performance.
Loading