PC-Net: Weakly Supervised Compositional Moment Retrieval via Proposal-Centric Network

Mingyao Zhou; Hao Sun; Wei Xie; Ming Dong; Chengji Wang; Mang Ye

PC-Net: Weakly Supervised Compositional Moment Retrieval via Proposal-Centric Network

Mingyao Zhou, Hao Sun, Wei Xie, Ming Dong, Chengji Wang, Mang Ye

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video understanding, multimodal alignment, contrastive learning

TL;DR: To alleviate the existing problems of precise timestamp dependence and limited query generalization, a weakly supervised compositional moment retrieval task is proposed, and a effective baseline is constructed to solve the new task challenges.

Abstract: With the exponential growth of video content, aiming at localizing relevant video moments based on natural language queries, video moment retrieval (VMR) has gained significant attention. Existing weakly supervised VMR methods focus on designing various feature modeling and modal interaction modules to alleviate the reliance on precise temporal annotations. However, these methods have poor generalization capabilities on compositional queries with novel syntactic structures or vocabulary in real-world scenarios. To this end, we propose a new task: weakly supervised compositional moment retrieval (WSCMR). This task trains models using only video-query pairs without precise temporal annotations, while enabling generalization to complex compositional queries. Furthermore, a proposal-centric network (PC-Net) is proposed to tackle this challenging task. First, video and query features are extracted through frozen feature extractors, followed by modality interaction to obtain multimodal features. Second, to handle compositional queries with explicit temporal associations, a dual-granularity proposal generator decodes multimodal global and frame-level features to obtain query-relevant proposal boundaries with fine-grained temporal perception. Third, to improve the discrimination of proposal features, a proposal feature aggregator is constructed to conduct semantic alignment of frames and queries, and employ a learnable peak-aware Gaussian distributor to fit the frame weights within the proposals to derive proposal features from the video frame features. Finally, the proposal quality is assessed based on the results of reconstructing the masked query using the obtained proposal features. To further enhance the model's ability to capture semantic associations between proposals and queries, a quality margin regularizer is constructed to dynamically stratify proposals into high and low query-relevance subsets and enhance the association between queries and common elements within proposals, and suppress spurious correlations via inter-subset contrastive learning. Notably, PC-Net achieves superior performance with 54\% fewer parameters than prior works by parameter-efficient design. Experiments on Charades-CG and ActivityNet-CG demonstrate PC-Net’s ability to generalize across diverse compositional queries. Code is available at https://github.com/mingyao1120/PC-Net.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 3559

Loading