Transformer-Based Human Action Recognition with Dynamic Feature Selection

Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier

Published: 01 Jan 2023, Last Modified: 06 Feb 2024CRV 2023Readers: Everyone

Abstract: Human action recognition in videos is an important task of computer vision that aims to automatically recognize and classify human actions in video sequences. However, accurately recognizing human actions can be challenging due to the complexity and variability of human motion and appearance. In this paper, we propose ActiViT, a novel approach for human action recognition in videos based on a Transformer architecture. Unlike existing methods that rely on convolutional or recurrent layers, our model is entirely based on the Transformer encoder, enabling us to leverage valuable information in action image patches features. We demonstrate that by dynamically selecting key patches guided by specific human poses, our model learns informative features useful for distinguishing between different actions. Our experimental results on real-world datasets convincingly demonstrate the effectiveness of our model and the importance of selecting discriminative key poses for action recognition.

0 Replies