Video Understanding via Convolutional Temporal Pooling Network and Multimodal Feature Fusion

Heeseung Kwon, Suha Kwak, Minsu Cho

2018 (modified: 18 Nov 2022)CoVieW@MM 2018Readers: Everyone

Abstract: In this paper, we present a new end-to-end convolutional neural network architecture for video classification, and apply the model to action and scene recognition in untrimmed videos for the Challenge on Comprehensive Video Understanding in the Wild. The proposed architecture takes densely sampled video frames as inputs, and apply a temporal pooling operator inside the network to capture temporal context of the input video. As a result, our architecture outputs distinct video-level features with a set of different temporal pooling operators. Furthermore, we design a multimodal feature fusion model by concatenating our video-level features with those given in the challenge dataset. Experimental results on the challenge dataset demonstrate that the proposed architecture and the multimodal feature fusion approach together achieve outstanding performance in action and scene recognition.

0 Replies