Jointly modeling deep video and compositional text to bridge vision and language in a unified framework
Abstract: Recently, joint video-language modeling has been attracting
more and more attention. However, most existing approaches
focus on exploring the language model upon on a fixed visual model. In this paper, we propose a unified framework
that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a
joint embedding model. In our language model, we propose a
dependency-tree structure model that embeds sentence into a
continuous vector space, which preserves visually grounded
meanings and word order. In the visual model, we leverage
deep neural networks to capture essential semantic information from videos. In the joint embedding model, we minimize
the distance of the outputs of the deep video model and compositional language model in the joint space, and update these
two models jointly. Based on these three parts, our system
is able to accomplish three tasks: 1) natural language generation, and 2) video retrieval and 3) language retrieval. In
the experiments, the results show our approach outperforms
SVM, CRF and CCA baselines in predicting Subject-VerbObject triplet and natural sentence generation, and is better
than CCA in video retrieval and language retrieval tasks.
0 Replies
Loading