HawkEye: Training Video-Text LLMs for Temporal Grounding with Verbal Referencing

HawkEye: Training Video-Text LLMs for Temporal Grounding with Verbal Referencing

ACL ARR 2025 February Submission952 Authors

12 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Video-text large language models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on videos. However, without targeted training, they perform almost the same as random on time-sensitive tasks like temporal grounding, as these models have not learned to use numbers to represent the start and end timestamps of video segments. In this paper, we investigate using a verbal reference method, such as "at the beginning" or "in the end," as alternatives to timestamps for referencing video segments. We demonstrate that video-text LLMs, even those not trained on video-segment level annotations, possess a substantial capability to perform temporal video grounding tasks with the proposed verbal reference method. To further demonstrate its efficacy and robustness, we propose HawkEye, a video-text LLM that has not only state-of-the-art performance on zero-shot temporal grounding, but also comparable performance with existing video-text LLMs across a spectrum of other video-text tasks. To train HawkEye, we propose InternVid-G, a large-scale video-text corpus with segment-level annotations for temporal grounding training. We also explore some practical training techniques such as mining grounding context spans from whole videos, and data augmentation by random cropping videos.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, cross-modal application

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 952

Loading