Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube Retrieval

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video Tube Retrieval (VTR) has attracted wide attention in the multi-modal domain, aiming to accurately localize the spatial-temporal tube in videos based on the natural language description. Despite the remarkable progress, existing VTR models trained on a specific domain (source domain) often perform unsatisfactory in another domain (target domain), due to the domain gap. Toward this issue, we introduce the learning strategy, Unsupervised Domain Adaptation, into the VTR task (UDA-VTR), which enables the knowledge transfer from the labeled source domain to the unlabeled target domain without additional manual annotations. An intuitive solution is generating the pseudo labels for the target-domain samples with the fully trained source model and fine-tuning the source model on the target domain with pseudo labels. However, the existing domain gap gives rise to two problems for this process: (1) The transfer of model parameters across domains may introduce source domain bias into target-domain features, significantly impacting the feature-based prediction for target domain samples. (2) The pseudo labels often tend to identify video tubes that are widely present in the source domain, rather than accurately localizing the correct video tubes specific to the target domain samples. To address the above issues, we propose the unsupervised domain adaptation model via Hierarchical dEbiAsing and noisy correction for cRoss-domain video Tube retrieval (HEART), which contains two characteristic modules: Layered Feature Debiasing (including the adversarial feature alignment and the graph-based alignment) and Pseudo Label Refinement. Extensive experiments prove the effectiveness of our HEART model by significantly surpassing the state-of-the-arts. The code is available (https://anonymous.4open.science/r/HEART).
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: To the best of our knowledge, we take the early exploration of the unsupervised domain adaptation for the video tube retrieval task. Toward this issue, we propose a novel model, HEART, based on the teacher-student framework.
Supplementary Material: zip
Submission Number: 5176
Loading