Enhancing Human Action Recognition with Fine-grained Body Movement Attention

Published: 01 Jan 2024, Last Modified: 18 Sept 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In the field of vision-language models (VLMs), human action recognition models, while effective, always rely on large pre-trained models or high-resolution inputs, leading to computational challenges. To address this, we propose a novel VLM approach with fine-grained attention to body movements. Unlike methods relying on coarse video-text matching, we guide the model to infer actions from fine-grained body part movements using two techniques: fine-tuning pre-trained encoders at the fine-grained level and matching labels from language and vision perspectives at the coarse-grained level. Experiments show our model excels in fully-supervised, few-shot, and zero-shot scenarios with just 8 random frames and a ViT-B/32 backbone. It outperforms most ViT-L/14 based models, demonstrating effectiveness while saving computational resources. The largest Top-1 accuracy improvement over second-best approaches is 6.8%.
Loading