Does Your Video-language Model Actually Understand the Language Input?

Xiang Fang; Wanlong Fang; Changshuo Wang; Daizong Liu; Xiaoye Qu

Does Your Video-language Model Actually Understand the Language Input?

Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Xiaoye Qu

19 Sept 2024 (modified: 23 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video-language Model, Coarse-grained Language Alignment, Attribute-based Text Reasoning, Fine-grained Language Alignment

Abstract: Driven by the wave of Large Language Models (LLMs), Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between video and text. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such an assumption is impossible to satisfy, since predefining all the texts is extremely time-consuming and labor-intensive. Besides, these predefined text inputs are too strict and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics lead to various performances. To this end, in this paper, we propose a novel text-augmented VLM method to improve video-text fusion by text rewriting. Specifically, we first generate various text samples from the original ones based on the pre-trained LLM to target specific text components. A multi-level contrastive learning module is designed to mine the coarse-grained language information. Moreover, we also propose an attribute-based text reasoning strategy to learn fine-grained textual semantics. Extensive experiments on many video-language tasks show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLM works.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1695

Loading