Abstract: Video action recognition has been a hot research direction in computer vision, with most existing technologies focusing on coarse-grained macro-action recognition. However, fine-grained action recognition remains challenging. Micro-actions, characterized by high fine-grained, low-intensity, and brief, are crucial for emotion recognition and psychological assessment applications. In this paper, we build on popular video action recognition frameworks as foundation models, introducing multi-auxiliary heads and hybrid loss optimization to advance micro-action recognition. Specifically, the Frame-Level pred and Coarse-Grained Body-Action auxiliary heads work collaboratively to enhance the model and Fine-Grained Micro-Action primary head for perceiving fine-grained and capturing keyframes. Incorporating F1 loss, ArcFace loss, and weighted multi-task loss improves training stability, convergence speed, and performance. Additionally, integrating the optical flow modality enriches the model's diversity, and ensemble learning across all foundational models. Finally, our method achieves a 75.37% F1-mean on the MA-52 dataset, ranking 1st in the Micro-Action Analysis Grand Challenge in conjunction with ACM MM'24. The code is available at https://github.com/qklee-lz/ACMMM2024-MAC.
Loading