Keywords: Video captioning, transformer, cross-modality
Abstract: As the most critical components in a sentence, subject, predicate, and object require special attention in the video captioning task. In this paper, we design the collaborative three-stream transformers to model the interactions of objects, and the actions/relations of objects between different modalities. Specifically, it is formed by three branches of transformers used to exploit the visual-linguistic interactions of different granularities in spatio-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we design a cross-modality attention module to align the interactions modeled by the three branches of transformers. That is, an affinity matrix is computed to help align visual modalities by injecting the information from other interactions. In this way, the three branches of transformers can support each other to exploit the most discriminative semantic information in different modalities for accurate predictions of captions, especially for the subject, predicate, and object parts in a sentence. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on two large-scale challenging datasets, i.e., YouCookII and ActivityNet Captions, demonstrate that the proposed method performs favorably against the state-of-the-art methods.
6 Replies
Loading