MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Yuxuan Luo; Yuhui Yuan; Junwen Chen; Haonan Cai; Ziyi Yue; Yuwei Yang; Fatima Zohra Daha; Ji Li; Zhouhui Lian

MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning in Image, Cross-domain knowledge Benchmark, Diffusion Models, Knowledge Graph

Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning—a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers $4,456$ expert-validated (knowledge) image-prompt pairs spanning $10$ disciplines, $6$ educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies. We further introduce MMMG-Score‌ to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of $21$ state-of-the-art text-to-image generation models expose serious reasoning deficits—low entity fidelity, weak relations, and clutter—with GPT-4o achieving an MMMG-Score‌ of only $50.20$, underscoring the benchmark’s difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score‌ of $34.45$), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on $16,000$ curated knowledge image–prompt pairs.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/MMMGBench/MMMG

Code URL: https://github.com/MMMGBench/MMMG

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in computer vision

Flagged For Ethics Review: true

Submission Number: 1263

Loading