Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Video Understanding;Modality alignment;Representation decoupling;Fine-grained
Abstract: This paper proposes Video-Teller, a Video-Text foundational model that leverages multi modal fusion and fine-grained modality alignment to significantly enhance the cross-modal generation task. Video-Teller boosts the training efficiency by utilizing frozen pre-trained vision and language modules. Furthermore, it capitalizes on the robust linguistic capabilities of large language model, enabling the generation of more nuanced descriptions for videos (video summaries). To effectively integrate visual and auditory information and improve the model's understanding of videos, Video-Teller employs cascaded Q-Former to fuse information from different frames and modalities. In addition to conventional loss functions, we introduce an additional Text Auto-Encoder to decouple the target text for fine-grained modality alignment, further optimizing the model. Experimental results demonstrate the efficacy of our proposed video foundational model in accurately comprehending videos and generating coherent and precise language descriptions. It is worth noting that the fine-grained alignment enhances the model's capabilities with a relatively minimal increase in cost.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3062
Loading