Enhancing Skeleton-Based Action Recognition With Language Descriptions From Pre-Trained Large Multimodal Models

Tian He, Yang Chen, Xu Gao, Ling Wang, Ting Hu, Hong Cheng

Published: 01 Mar 2025, Last Modified: 23 Nov 2025IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Skeleton data has become popular in human action recognition because of its efficacy in capturing human motion patterns while mitigating the influence of environmental noise. However, overlooking critical action-related environmental descriptors presents challenges in distinguishing actions characterized by similar body movements. To address this limitation, we propose a novel framework that integrates skeleton data with language descriptions to easily capture essential environmental information for fine-grained action recognition while maintaining the robustness of skeleton-based methods. We first develop a Language Environment Description Generation (LEDG) module that utilizes the open-world understanding ability of Large Multimodal Models to generate instance-level action-related language environment descriptions without the need to train additional modules. Then, we introduce a Skeleton-supported Environment Feature Extraction (SEFE) module that leverages the temporal dependency inherent in skeleton data to extract key semantic environmental features. Additionally, we propose an Entropy-based Feature Fusion (EFF) module to dynamically amalgamate complementary features from both skeleton and language domains. Experimental results demonstrate the superiority of our framework, which can improve the accuracy of existing skeleton-based action recognition methods and achieve state-of-the-art performance on four well-established skeleton-based action recognition benchmarks.

External IDs:doi:10.1109/tcsvt.2024.3491176