MathViz-Bench: Evaluating Text-to-Image Models on Visually Solving Math Problems

Shang Wu; Jerry Yao-Chieh Hu; Zhao Song; Han Liu

MathViz-Bench: Evaluating Text-to-Image Models on Visually Solving Math Problems

Shang Wu, Jerry Yao-Chieh Hu, Zhao Song, Han Liu

07 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-image models, foundation model, benchmark

Abstract: We present MathViz-Bench, a comprehensive benchmark for evaluating Text-to-Image (T2I) models' capability to visualize step-by-step solutions for high school mathematics problems. MathViz-Bench comprises 500 carefully curated problems sampled from levels 1-3 of the MATH dataset, spanning seven mathematical domains: Prealgebra, Algebra, Number Theory, Counting \& Probability, Geometry, Intermediate Algebra, and Precalculus. We transform these problems into prompts requiring models to generate visual step-by-step solutions with proper mathematical notation and logical flow. Our automated assessment pipeline employs three metrics: Sequential Consistency for logical flow, Symbol Fidelity for notation accuracy, and Mathematical Correctness for calculation validity, each scored 0-5. Our evaluation shows that models with built-in language understanding (GPT-Image-1: 84.05\%, Gemini-2.5-Pro: 75.13\%) perform much better than diffusion models (FLUX1.1-Pro: 35.23\%, WAN2.2: 28.05\%, Stable Diffusion 3.5 Ultra: 22.05\%), achieving 2-3 times higher scores. All models exhibit high Symbol Fidelity (1.81-4.57) but fail at Mathematical Correctness (0.65-4.05), indicating they process mathematical symbols as visual patterns rather than semantic operators. Diffusion models demonstrate complete difficulty invariance and 33.8\% critical failure rates, confirming absence of mathematical reasoning. These findings establish that mathematical visualization requires architectural integration of symbolic reasoning with visual generation, beyond current T2I capabilities.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 2732

Loading