Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Video-Language models, LLM, Video Understanding, Zero-shot
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
Recently, large language models (LLMs) have been used to enrich the text-based
class labels by enhancing the descriptiveness of the class names. However, these
improvements are restricted to the text-based classifier only, and the query visual
features are not considered. In this paper, we propose a framework which combines pre-trained discriminative VLMs with pre-trained generative video-to-text
and text-to-text models. We introduce two key modifications to the standard zero-shot setting. First, we propose language-guided visual feature enhancement and
employ a video-to-text model to convert the query video to its descriptive form.
The resulting descriptions contain vital visual cues of the query video, such as
what objects are present and their spatio-temporal interactions. These descriptive cues provide additional semantic knowledge to VLMs to enhance their zero-shot performance. Second, we propose video-specific prompts to LLMs to generate more meaningful descriptions to enrich class label representations. Specifically, we introduce prompt techniques to create a Tree Hierarchy of Categories for
class names, offering a higher-level action context for additional visual cues, We
demonstrate the effectiveness of our approach in video understanding across three
different zero-shot settings: 1) video action recognition, 2) video-to-text and text-to-video retrieval, and 3) time-sensitive video tasks. Consistent improvements
across multiple benchmarks and with various VLMs demonstrate the effectiveness of our proposed framework. Our code will be made publicly available.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5632
Loading