Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Adaptive Feature Abstraction for Translating Video to Language
Yunchen Pu, Martin Renqiang Min, Zhe Gan, Lawrence Carin
Nov 04, 2016 (modified: Mar 05, 2017)ICLR 2017 conference submissionreaders: everyone
Abstract:Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video representations, preventing them from modeling rich, varying context-dependent semantics in video descriptions. In this paper, we propose a new approach to generating adaptive spatiotemporal representations of videos for a captioning task. For this purpose, novel attention mechanisms with spatiotemporal alignment is employed to adaptively and sequentially focus on different layers of CNN features (levels of feature ``abstraction''), as well as local spatiotemporal regions of the feature maps at each layer. Our approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.
Keywords:Computer vision, Deep learning
Conflicts:duke.edu, nec-labs.com, virginia.edu
Enter your feedback below and we'll get back to you as soon as possible.