Keywords: video moment retrieval, diverse query types, model versatility
Abstract: With the huge requirement of video content understanding and editing, Video moment retrieval (VMR) is becoming more and more critical, necessitating models that are adept at correlating video contents with textual queries. The effectiveness of prevailing VMR models, however, is often compromised by their reliance on training data biases, which significantly hampers their generalization capabilities when faced with out-of-distribution (OOD) content. This challenge underscores the need for innovative approaches that can adeptly navigate the intricate balance between leveraging in-distribution (ID) data for learning and maintaining robustness against OOD variations. Addressing this critical need, we introduce Reflective Knowledge Distillation (RefKD), a novel and comprehensive training methodology that integrates the dual processes of Introspective Learning and Extrospective Adjustment. This methodology is designed to refine the model's ability to internalize and apply learned correlations in a manner that is both contextually relevant and resilient to bias-induced distortions. By employing a dual-teacher framework, RefKD encapsulates and contrasts the distinct bias perspectives prevalent in VMR datasets, facilitating a dynamic and reflective learning dialogue with the student model. This interaction is meticulously structured to encourage the student model to engage in a deeper introspection of learned biases and to adaptively recalibrate its learning focus in response to evolving content landscapes. Through this reflective learning process, the model develops a more nuanced and comprehensive understanding of content-query correlations, significantly enhancing its performance across both ID and OOD scenarios. Our extensive evaluations, conducted across several standard VMR benchmarks, demonstrate the unparalleled efficacy of RefKD. The methodology not only aligns with the OOD performance benchmarks set by existing debiasing methods but also, in many instances, significantly surpasses their ID performance metrics. By effectively bridging the gap between ID and OOD learning, RefKD sets a new standard for building VMR systems that are not only more adept at understanding and interpreting video content in a variety of contexts but also more equitable and reliable across diverse operational scenarios. This work not only contributes to the advancement of VMR technology but also paves the way for future research in the domain of bias-aware and robust multimedia content analysis.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1720
Loading