Chat-UniVi: A Unified Vision-Language Model for Image and Video Understanding

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: vision and language, large language models, image and video understanding
TL;DR: We introduce Chat-UniVi, a unified multimodal large language model that represents images and videos using a collection of dynamic visual tokens.
Abstract: Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. In this study, we introduce Chat-UniVi, a unified vision-language model capable of comprehending and engaging in conversations involving images and videos. Specifically, Chat-UniVi uniformly represents images and videos using a collection of dynamic visual tokens. This novel representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Besides, we leverage a multi-scale representation that equips large language models to perceive both high-level semantic concepts and low-level visual details. More encouragingly, Chat-UniVi is trained on a mixed dataset containing both images and videos, making it directly applicable to tasks involving both mediums without the need for any modifications. Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently surpasses even the existing methods exclusively designed for either images or videos. To the best of our knowledge, Chat-UniVi represents the first successful unified multimodal large language model that consistently outperforms both dedicated image and video models.
Supplementary Material: zip
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2201
Loading