MMTS: Multimodal Teacher-Student learning for One-Shot Human Action Recognition

Published: 01 Jan 2023, Last Modified: 14 Nov 2024BigComp 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Human action recognition (HAR) is applied to many real-world applications, such as visual surveillance, video retrieval, and autonomous driving vehicles. It can utilize various modalities such as RGB, infrared, depth, or skeleton. Among these, we selected and used a skeleton suited to real-time application because it requires less input than RGB data. Furthermore, we focused on a one-shot setting. The skeleton data tends to have a smaller dataset size than other modalities, so it is hard to expect the powerful generalization ability to make representation from unseen data (i.e. novel class). Therefore, to solve this problem, we proposed a skeleton-text multimodal learning method by borrowing a powerful pretrained text encoder that was trained using a large-scale dataset. Our method utilizes the teacherstudent approach through the skeleton-text dataset and only uses the skeleton for inferences. The proposed method is more suitable for one-shot skeleton-based HAR compared to the existing multimodal learning method. Our approach outperformed the stateof-the-art methods for the one-shot action recognition protocol on the NTU RGB+D120 dataset.
Loading