Adaptive Feature Abstraction for Translating Video to Language

Yunchen Pu, Martin Renqiang Min, Zhe Gan, Lawrence Carin

Nov 04, 2016 (modified: Mar 05, 2017) ICLR 2017 conference submission readers: everyone
  • Abstract: Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video representations, preventing them from modeling rich, varying context-dependent semantics in video descriptions. In this paper, we propose a new approach to generating adaptive spatiotemporal representations of videos for a captioning task. For this purpose, novel attention mechanisms with spatiotemporal alignment is employed to adaptively and sequentially focus on different layers of CNN features (levels of feature ``abstraction''), as well as local spatiotemporal regions of the feature maps at each layer. Our approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.
  • Keywords: Computer vision, Deep learning
  • Conflicts: duke.edu, nec-labs.com, virginia.edu

Loading