Composed Query-Based Event Retrieval in Video Corpus with Multimodal Episodic Perceptron

Published: 2025, Last Modified: 12 Nov 2025ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Event retrieval involves searching for specific events from untrimmed video galleries and has garnered significant attention in recent years. However, most existing works follow a text-based video retrieval paradigm only, limited by two main drawbacks: (1) The episodic information presented in described events is not fully perceived, leading to declines in retrieval performance facing variable query intentions. (2) Current models are prone to returning false positive results with similar semantics, as simple text queries can hardly accurately describe the target video content users seek. In this paper, we propose a novel event retrieval framework termed Composed Query-Based Event Retrieval (CQBER). Specifically, we first construct two CQBER benchmark datasets, namely ActivityNet-CQ and TVR-CQ, which cover TV shows and open-world scenarios, respectively. Additionally, we propose an initial CQBER method, termed Multimodal Episodic Perceptron (MEP), which excavates complete query semantics from both observed static visual cues and various descriptions. Extensive experiments demonstrate that our proposed framework significantly boosts event retrieval accuracy across different existing methods. Our code and datasets are available at https://github.com/VincentVanNF/CQBER.
Loading