Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Adaptive Feature Abstraction for Translating Video to Language
Yunchen Pu, Martin Renqiang Min, Zhe Gan, Lawrence Carin
Feb 14, 2017 (modified: Mar 17, 2017)ICLR 2017 workshop submissionreaders: everyone
Abstract:A new model for video captioning is developed, using a deep three-dimensional Convolutional Neural Network (C3D) as an encoder for videos and a Recurrent Neural Network (RNN) as a decoder for captions. A novel attention mechanism with spatiotemporal alignment is employed to adaptively and sequentially focus on different layers of CNN features (levels of feature "abstraction"), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on the YouTube2Text benchmark. Experimental results demonstrate quantitatively the effectiveness of our proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantic structures.
Conflicts:duke.edu, nec-labs.com, virginia.edu
Enter your feedback below and we'll get back to you as soon as possible.