Abstract: Human Activity Recognition (HAR) is a domain of increasing interest with several two-stream architectures being suggested in recent years. However, such models have a huge number of parameters and storage needs due to the presence of a dedicated temporal stream. In this paper, we propose an architecture comprising of the weighted late fusion between the Softmax scores of the spatiotemporal stream (I3D) and another 2D convolutional neural network stream (Xception). We show that our model produces competitive performance w.r.t to other existing spatial and two-stream architectures along with reducing the number of parameters significantly and minimizing storage costs.
0 Replies
Loading