Keywords: Multimodal LLM, Comprehension and Generation
Abstract: The rapid evolution of multimodal foundation models has showcased remarkable capabilities in vision-language understanding and generation, yielding impressive results on academic benchmarks. However, there remains a gap in their progress toward real-world applicability, primarily due to the models' limited capacity to effectively respond to various user instructions and interact with diverse visual data. This limitation can be attributed to the fundamental challenge of modeling multi-granularity visual semantics for comprehension and generation tasks. In this paper, we take a pioneering step towards applying multimodal foundation models in an open-world context and present a unified and versatile foundation model, namely, $\textbf{SEED-X}$. As the first of its kind, SEED-X seamlessly integrates two essential features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation.
Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. All models, training, and inference codes are available at https://anonymous.4open.science/r/SEED-X/.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2836
Loading