Abstract: Addressing the challenge of effectively capturing features in contemporary video tasks, we propose an action recognition approach grounded in keyframe filtering and feature fusion. Our method comprises two core modules. The keyframe screening module employs an attention mechanism to segregate the input depth feature map sequence into two distinct tensors, effectively reducing spatial redundancy computation and enhancing key feature capture. The other spatio-temporal and action feature module features two branches with divergent structures, performing spatio-temporal and action feature extraction on the differentiated features from the previous module. Through these closely linked modules, our approach effectively discerns and extracts meaningful video features for subsequent classification tasks. We construct an end-to-end deep learning model using established frameworks, training and validating it on a generic video dataset, and confirm its efficacy through comparison and ablation experiments. Experiments conducted on this dataset demonstrate that our model surpasses the majority of prior works.
Loading