A Needle In A Haystack: Referring Hour-Level Video Object Segmentation

01 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Referring Video Object Segmentation, Hierarchical Memory
Abstract: Long-term videos over minutes are ubiquitous in daily life while existing Referring Video Object Segmentation (RVOS) datasets are limited to short-term videos with a duration of only 5-60 seconds. To unveil the dilemma of referring object segmentation towards hour-level videos, we construct the first Hour-level Referring Video Object Segmentation (Hour-RVOS) dataset characterized by (1) any-length videos from seconds to hours, (2) rich-semantic expressions with double length, and (3) multi-round interactions according to target change. These unique characteristics further bring tough challenges including (1) **Sparse object distribution**: Segmenting target objects in sparse-distributed key-frames from massive amounts of frames is like finding a needle in a haystack. (2) **Long-range correspondence**: Intricate linguistic-visual associations are required to establish across thousands of frames. To address these challenges, we propose a semi-online hierarchical-memory-association RVOS method for building cross-modal long-range correlations. Through interleaved propagation of hierarchical memory and dynamic balance of linguistic-visual tokens, our method can adequately associate multi-period representations of target objects in a real-time way. The benchmark results show that existing offline methods have to struggle with hour-level videos in multiple stages, whereas our proposed method without LLMs can achieve over $15\%$% accuracy improvements compared to Sa2VA-8B when handling any-length videos with multi-round and various-semantic expressions in one-stage.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 63
Loading