Abstract: The most typical methods of human action recognition in videos rely on features extracted by deep neural network. Inspired by the temporal segment network, the sparse-temporal segment network to recognize human actions is proposed. Considering the sparse features contains the information of moving objects in videos, for example marginal information which is helpful to capture the target region and reduce the interference from similar actions, the robust principal component analysis algorithm was used to extract sparse features coping with background motion, illumination changes, noise and poor image quality. Based on different characteristics of three modal data, three parallel networks including RGB frame-network, optical flow-network and sparse feature-network were constructed and then fused through diverse ways. Comparative evaluations on the UCF101 demonstrate that three modal data contain the complementary features. Extensive experiments in subjective and objective show that temporal-sparse segment network can reach the accuracy of 94.2%, which is significantly better than several state-of-the-art algorithms.
0 Replies
Loading