Abstract: Since a large number of videos have sprung up, the human-centric visual understanding tasks especially human action recognition (HAR) is under great demand. The key issue of the HAR is to capture spatial-temporal collaborative representations by mining abstract and discriminative high-level features. In this article, we propose a cross-modal learning framework that mainly includes an alignment net and a fusion net to improve the performance of the HAR. First, extracted different modal information is mapped into a common subspace to align, which compensates for the spatial-temporal discrepancies. Then, the aligned features are further fused to generate complementary, correlated, and consistent representations. Finally, the learnt features are input to the classifier for recognition. The experimental results have shown that our proposed approach can outperform several state-of-the-art baseline approaches.
External IDs:dblp:journals/sj/ZhengZ21
Loading