Abstract: Video-text large language models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on videos. However, without targeted training, they perform almost the same as random on time-sensitive tasks like temporal grounding, as these models have not learned to use numbers to represent the start and end timestamps of video segments.
In this paper, we investigate using a verbal reference method, such as "at the beginning" or "in the end," as alternatives to timestamps for referencing video segments. We demonstrate that video-text LLMs, even those not trained on video-segment level annotations, possess a substantial capability to perform temporal video grounding tasks with the proposed verbal reference method.
To further demonstrate its efficacy and robustness, we propose HawkEye, a video-text LLM that has not only state-of-the-art performance on zero-shot temporal grounding, but also comparable performance with existing video-text LLMs across a spectrum of other video-text tasks. To train HawkEye, we propose InternVid-G, a large-scale video-text corpus with segment-level annotations for temporal grounding training. We also explore some practical training techniques such as mining grounding context spans from whole videos, and data augmentation by random cropping videos.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal application
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 952
Loading