Abstract: Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work, we introduce ModaVerse, a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities in-cluding images, videos, and audio. Predominant MLLM frameworks have largely relied on aligning latent spaces of textual and non-textual features. This alignment process, which synchronizes a language model trained on tex-tual data with encoders and decoders trained on multimodal data, often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies, we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models, avoiding the complexities associated with latent feature alignments, and simplifying the multiple training stages of existing MLLMs into a single, efficient process. By conducting experiments on several benchmarks, we demonstrate that our approach at-tains comparable performance with the state of the art while achieving considerable efficiencies in data usage. The code is available at https://github.com/xinke-wang/ModaVerse.
External IDs:dblp:conf/cvpr/WangZW24
Loading