PATHS: Parameter-wise Adaptive Two-Stage Training Harnessing Scene Transition Mask Adapters for Video Retrieval

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: text to video retrieval, transfer learning, adapter, CLIP
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Image-text pre-trained model, e.g., CLIP, has gained significant traction even in the field of video-text learning. Recent approaches extended CLIP to video tasks, and have achieved unprecedented performances in the foundational study of video understanding: text-video retrieval. However, unlike conventional transfer learning within the same domain, transfer learning across different modalities from images to videos often requires fine-tuning the whole pre-trained weights rather than keeping them frozen. This may result in overfitting and distorting the pre-trained weights, leading to a degradation in performance. To address this challenge, we introduce a learning strategy, termed Parameter-wise Adaptive Two-stage training Harnessing Scene transition mask adapter (PATHS). Our two-stage learning process alleviates the deviations of the pre-trained weights. A novel method of finding the optimal weights is used in the first stage, which efficiently narrows down to strong candidates by only monitoring the fluctuations of parameters. Once the parameters are fixed to optimal values, the second stage is dedicated to acquiring knowledge of scenes with an adapter module. PATHS can be applied to any existing models in a plug-and-play manner, and always achieves performance improvements from the base models. We report state-of-the-art performances across key text-video benchmark datasets, including MSRVTT and LSMDC. Our code is available at https://anonymous.4open.science/r/PATHS_.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2484
Loading