Interactive Multi-event Video Retrieval with Context Integration and Position Constraint

07 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interactive Retrieval, Text-to-Video Retrieval, Multi-event Video Retrieval
Abstract: Interactive video retrieval aims to progressively refine queries through multi-round interactions between the user and the system. Existing methods focus on pre-trimmed videos that provide captions that well describe the gist of the video content. In real-world scenarios, however, videos typically contain a sequence of unrelated and discontinuous events, while a query usually refers to a single event. This mismatch introduces significant challenges, including sensitivity to irrelevant content, lack of context exploitation, and insufficient position exploration. Motivated by this, we propose **CIPC**, a tailored interactive video retrieval framework with Context Integration and Position Constraint for multi-event videos. CIPC adaptively segments videos into event-consistent units, supports progressive interactions that exploit contextual information, and incorporates a position constraint to re-weight candidate segments by temporal distance, promoting better temporal alignment with the query. Extensive experiments and a user simulation study demonstrate the effectiveness and robustness of our approach, yielding 4.1\%–6.7\% R@1 improvements on three widely used benchmarks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2748
Loading