Text-Conditional Visual-Language Alignment for Video Captioning

Wenhui Jiang, Wenbin Guan, Haijun Li, Zhizhen Li, Yuming Fang, Yuxin Peng, Xiaowei Zhao, Yang Liu

Published: 01 Jan 2025, Last Modified: 09 Nov 2025IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0
Abstract: Video captioning remains a challenging task due to the diverse video content and the complex relationships between visual and textual elements. Recent efforts predominantly focus on multimodal architecture designs trained with paired video-caption data. Nonetheless, the learning paradigm suffers from the “one-to-many” corresponding problem, since one source video is mapped to multiple caption annotations. The difficulty of video captioning is further exacerbated by the poor-written captions, which mislead the captioner with irrelevant information. Essentially, the problem stems from the inadequate alignment between video and caption. In this work, we propose a Text-Conditional Alignment Transformer, which fully exploits the rich information provided by diverse labeled captions, and avoids the impacts of label ambiguity and noise. To alleviate the challenge of the “one-to-many” correspondence, we introduce Text-conditioned Video Encoding, which diversifies the video representation by emphasizing the spatial-temporal visual areas relevant to the given descriptions while filtering out redundant visual information. The refined video representation is well-aligned to match the corresponding text description, and naturally converts the “one-to-many” mapping to “one-to-one” mapping. To deal with the noisy annotations, we propose Quality-aware Caption Decoding. We first dynamically measure the qualities of different captions corresponding to the same video in a reference-free manner. Then the estimated qualities are further utilized as auxiliary signals, guiding the model to perform quality-aligned learning from noisy captions. We conduct extensive experiments on MSR-VTT, MSVD, VATEX and ActivityNet-Entities datasets, and demonstrate their consistent performance improvements compared to state-of-the-arts.
Loading