Two birds with one stone: Query-dependent moment retrieval in muted video or audio via inter-token interactions

Published: 2026, Last Modified: 07 Jan 2026Inf. Sci. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The surge of social media has led to massive growth in audio and video content, increasing the demand for tools that retrieve user-specific highlights, which are short segments aligned with sentence queries. In real-world scenarios, users often access only one modality, such as audio companions or muted video mode, and may switch between them depending on context. However, current methods treat muted video and audio retrieval as separate tasks, lacking a unified structure to support flexible modality adaptation. We propose Inter-token Aware Retrieval (IAR), a lightweight and unified framework for query-dependent moment retrieval under uni-modal constraints. IAR models audio/muted video and text as token sequences, and captures both intra-modal and cross-modal token interactions. It consists of three modules: a Boundary Enhancement module that strengthens token-level contrast between event and background regions, a Multi-Modal Alignment module that enhances token-to-token relevance between the query and a single modality (video/audio), and a Context Convolution module that aggregates local token relationships while preserving temporal continuity. A two-stage 2D map is used to score and select the best-matching moment. IAR achieves strong performance on Charades-STA, ActivityNet-Captions, and AudioGrounding-AMR, and also demonstrates effective generalization across muted video and audio tasks with high computational efficiency.
Loading