GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have constituted a significant leap forward in the field, particularly in the processing of videos, which encompasses inherent challenges such as spatiotemporal relationships. However, existing MLLMs are predominantly focused on the comprehension of video inputs, with limited capabilities in generating video content. In this paper, we present GPT4Video, a unified framework that seamlessly and lightly integrates with LLMs, visual feature extractors, and stable diffusion generative models for cohesive video understanding and generation. Moreover, we propose a text-only finetuning approach to equip models for instruction-following and safeguarding in multimodal conversations without requiring costly annotated video-based instructions. Additionally, we construct multi-turn and caption-interleaved datasets for finetuning and benchmarking MLLMs, which serve as solid resources for advancing this field. Through quantitative and qualitative assessments, GPT4Video demonstrates the following advantages: 1) The framework incorporates video generation ability without adding extra training parameters, ensuring seamless compatibility with various video generators. 2) The model achieves superior performances across a variety of benchmarks. For instance, it outperforms Valley by 11.8% on video question answering, and surpasses NExt-GPT by 2.3% on text-to-video generation. 3) As safety pioneers in open-source MLLMs, we developed finetuning and evaluation datasets, securing an F1 score exceeding 80% in blocking harmful content during understanding and generating videos. In general, GPT4Video shows potential to function as a real-life assistant, marked by its effectiveness, adaptability, and safety. We will open-source our code, data, and models.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work introduces GPT4Video, a groundbreaking framework that significantly advances multimedia/multimodal processing by integrating a visual feature extractor and a stable diffusion generative model with large language models (LLMs). Unlike previous models focused solely on multimodal input comprehension, GPT4Video extends capabilities to both understanding and generating video content. The innovation lies in its text-only finetuning approach and the development of specialized finetuning and evaluation datasets, which circumvent the need for costly multimodal annotations. GPT4Video demonstrates exceptional performance, outperforming established benchmarks in video question answering and text-to-video generation, while also leading as a safety pioneer in multimodal large language models by achieving an F1 score over 80% in filtering harmful content. The decision to open-source the code, data, and models not only facilitates further research and development in the field but also ensures GPT4Video's accessibility and adaptability.
Supplementary Material: zip
Submission Number: 4102
Loading