MAMS: MODEL-AGNOSTIC MODULE SELECTION FRAMEWORK FOR VIDEO CAPTIONING

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Video, VideoCaptioning, Multi-modal, model-agnostic, module selector, token selector
TL;DR: New video captioning framework to solve the problems of existing video captioning models.
Abstract: Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal methods extract a fixed number of frames, but this has a few critical challenges. If a limited number of frames is extracted, it is challenging to retrieve sufficient information for caption generation. Conversely, extracting an excessive number of frames can lead to the frames containing redundant information. We refer to the aforementioned challenges as information loss and excessive information similarity, respectively. This paper proposes a new model-agnostic module selection framework that can choose a module with an appropriate size through the flow selector and token selector. The proposed framework can select an appropriate size of features for each video data during training and inference. Using this framework, we moderate the issues of information loss and excessive information similarity that arise from extracting a fixed number of frames. In addition, we further moderate the excessive information similarity issue in each flow by adding diversity-promoting losses. Our numerical experiments with two different datasets demonstrate that the proposed framework significantly improves the performances of three different existing representative/state-of-the-art video captioning models.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6438
Loading