A Convolutional Temporal Encoder for Video Caption Generation

Qingle Huang, Zicheng Liao

15 Feb 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: We propose a convolutional temporal encoding network for video sequence embed- ding and caption generation. The mainstream video captioning work is based on recur- rent encoder of various forms (e.g. LSTMs and hierarchical encoders). In this work, a multi-layer convolutional neural network encoder is proposed. At the core of this en- coder is a gated linear unit (GLU) that performs a linear convolutional transformation of input with a nonlinear gating, which has demonstrated superior performance in nat- ural language modeling. Our model is built on top of this unit for video encoding and integrates several up-to-date tricks including batch normalization, skip connection and soft attention. Experiment on two large-scale benchmark datasets (MSAD and M-VAD) generates strong results and demonstrates the effectiveness of our model.

0 Replies