A Needle In A Haystack: Referring Hour-Level Video Object Segmentation

Shengye Qiao; Changqun Xia; Yifan Zhao; Yanjie Liang; Jia Li

A Needle In A Haystack: Referring Hour-Level Video Object Segmentation

Shengye Qiao, Changqun Xia, Yifan Zhao, Yanjie Liang, Jia Li

01 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Referring Video Object Segmentation, Hierarchical Memory

Abstract: Long-term videos over minutes are ubiquitous in daily life while existing Referring Video Object Segmentation (RVOS) datasets are limited to short-term videos with a duration of only 5-60 seconds. To unveil the dilemma of referring object segmentation towards hour-level videos, we construct the first Hour-level Referring Video Object Segmentation (Hour-RVOS) dataset characterized by (1) any-length videos from seconds to hours, (2) rich-semantic expressions with double length, and (3) multi-round interactions according to target change. These unique characteristics further bring tough challenges including (1) **Sparse object distribution**: Segmenting target objects in sparse-distributed key-frames from massive amounts of frames is like finding a needle in a haystack. (2) **Long-range correspondence**: Intricate linguistic-visual associations are required to establish across thousands of frames. To address these challenges, we propose a semi-online hierarchical-memory-association RVOS method for building cross-modal long-range correlations. Through interleaved propagation of hierarchical memory and dynamic balance of linguistic-visual tokens, our method can adequately associate multi-period representations of target objects in a real-time way. The benchmark results show that existing offline methods have to struggle with hour-level videos in multiple stages, whereas our proposed method without LLMs can achieve over $15\%$% accuracy improvements compared to Sa2VA-8B when handling any-length videos with multi-round and various-semantic expressions in one-stage.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 63

Loading