Keywords: Reasoning in Image, Cross-domain knowledge Benchmark, Diffusion Models, Knowledge Graph
Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models.
Knowledge images have been central to human civilization and to the mechanisms of human learning—a fact underscored by dual-coding theory and the picture-superiority effect.
Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals.
To enable comprehensive evaluation, MMMG offers $4,456$ expert-validated (knowledge) image-prompt pairs spanning $10$ disciplines, $6$ educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. 
To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies.
We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment.
Comprehensive evaluations of $21$ state-of-the-art text-to-image generation models expose serious reasoning deficits—low entity fidelity, weak relations, and clutter—with GPT-4o achieving an MMMG-Score of only $50.20$, underscoring the benchmark’s difficulty.
To spur further progress, we release FLUX-Reason (MMMG-Score of $34.45$), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on $16,000$ curated knowledge image–prompt pairs.
Croissant File:  json
Dataset URL: https://huggingface.co/datasets/MMMGBench/MMMG
Code URL: https://github.com/MMMGBench/MMMG
Supplementary Material:  pdf
Primary Area: Datasets & Benchmarks for applications in computer vision
Flagged For Ethics Review: true
Submission Number: 1263
Loading