Keywords: Video Understanding
Abstract: Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags---automatically extracted from foundation models---to enhance video retrieval.
Previous works have proposed to emulate human reasoning by introducing latent concepts derived from the features of a video and its corresponding caption. Building on these efforts to align latent concepts across both modalities, we propose learning auxiliary concepts from modality-specific tags.
We introduce these auxiliary concepts to improve the alignment of visual and textual latent concepts, and so be able to distinguish each concept from the other.
To strengthen the alignment between visual and textual latent concepts—where a set of visual concepts matches a corresponding set of textual concepts—we introduce an Alignment Loss. This loss aligns the proposed auxiliary concepts with the modalities' latent concepts, enhancing the model's ability to accurately match videos with their appropriate captions.
We conduct extensive experiments on three diverse datasets: MSR-VTT, DiDeMo, and ActivityNet Captions. The experimental results consistently demonstrate that modality-specific tags significantly improve cross-modal alignment, achieving performance comparable to current state-of-the-art methods.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3585
Loading