On Path to Multimodal Generalist: General-Level and General-Bench

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Abstract: The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of language-based LLMs. Unlike their specialist predecessors, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting singular modalities to accommodating a wide array of or even arbitrary modalities. To assess the capabilities of various MLLMs, a diverse array of benchmark test sets has been proposed. This leads to a critical question: *Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI?* We argue that the answer is not as straightforward as it seems. In this project, we introduce an evaluation framework to delineate the capabilities and behaviors of current multimodal generalists. This framework, named **General-Level**, establishes 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI (Artificial General Intelligence). Central to our framework is the use of **Synergy** as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark, **General-Bench**, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project Page: https://generalist.top/, Leaderboard: https://generalist.top/leaderboard/, Benchmark: https://huggingface.co/General-Level/.
Lay Summary: Artificial intelligence (AI) systems are increasingly capable of handling diverse types of data—such as text, images, and audio. However, many of these systems excel only in specific tasks or data types, lacking the broad adaptability seen in human intelligence. Also the existing evaluation paradigm that simply assumes that higher performance across tasks indicates a stronger MLLM capability can be problematic. Our research introduces two tools: General-Level, a framework that assesses an AI model's ability to integrate and apply knowledge across different tasks and data types; and General-Bench, a comprehensive dataset comprising over 700 tasks and 325,000 examples designed to evaluate this integrative capability. By applying these tools to over 100 existing AI models, we discovered that while some models perform well on individual tasks, they often struggle to transfer knowledge between different types of tasks or data. This indicates a gap in achieving truly general-purpose multimodal AGI. Our work aims to guide the development of more versatile AI systems that can seamlessly understand and generate multiple forms of data, moving us closer to AI that mirrors human-like general intelligence.
Link To Code: https://generalist.top/
Primary Area: Deep Learning->Foundation Models
Keywords: Large Language Model, Multimodal Large Language Model, Multimodal Generalist, Evaluation, Benchmark
Submission Number: 2912
Loading