Abstract: Existing human joint representations do not fully exploit the learning power of Convolutional Neural Networks (CNNs). We propose a representation for skeleton joint sequences that is both spatial and spatio-temporal with respect to the receptive fields of convolution kernels of CNN to facilitate learning from spacial locations of the joints as well as their transitions over time. Our representation allows for better hierarchical learning by CNNs as we transform skeleton sequences into images of flexible dimensions encoding rich spatial and spatio-temporal information about the joints by maximizing a unique distance metric, defined collaboratively over the distinct joint arrangements. Our representation additionally encodes the relative joint velocities. The proposed action recognition exploits the representation in a hierarchical manner by first capturing the micro-temporal relations between the skeleton joints using CNN and then exploiting their macro-temporal relations by computing the Fourier Temporal Pyramids. We ex- tend the Inception-ResNet CNN architecture with the pro- posed method and improve the state-of-the-art accuracy by 4.4% on the large scale NTU human activity dataset. On NUCLA and UTD-MHAD datasets, our method outperforms the existing results by 5.7% and 9.3% respectively.
0 Replies
Loading