TCVP: A Practical Pipeline for Video Moment Retrieval Datasets Using Timestamped Video Comments

ACL ARR 2026 January Submission8128 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Moment Retrieval, Temporal Grounding, Dataset Construction, Query Generation, Timestamped Comments, Comment Filtering, Modality Gating
Abstract: Video Moment Retrieval (VMR) aims to identify a temporal moment in a video that corresponds to a user query. Most existing VMR datasets are constructed by randomly selecting temporal moments and generating queries from the corresponding visual and auditory content. We find that this process often produces moments with limited importance and queries that resemble captions rather than user-driven searches. To address these limitations, we propose a practical pipeline for VMR dataset generation, named TCVP, where we leverage timestamped YouTube comments to identify interesting moments and reflect actual search intent. A naive use of YouTube comments introduces several challenges, as many comments are uninformative (e.g., “07:22 lol”), and comments may correspond to different modalities, requiring modality-aware handling. Our pipeline alleviates them by introducing comment filtering and modality gating as key methodological components. Our qualitative analysis shows that users prefer our dataset by a substantial margin (i.e., 70%) over existing baselines. Moreover, benchmarking models on our dataset highlights limitations of current VMR methods and offers insights for future work.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, video processing, speech and vision, human evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 8128
Loading