Keywords: Text-to-image models, foundation model, benchmark
Abstract: We present MathViz-Bench, a comprehensive benchmark for evaluating Text-to-Image (T2I) models' capability to visualize step-by-step solutions for high school mathematics problems.
MathViz-Bench comprises 500 carefully curated problems sampled from levels 1-3 of the MATH dataset, spanning seven mathematical domains: Prealgebra, Algebra, Number Theory, Counting \& Probability, Geometry, Intermediate Algebra, and Precalculus.
We transform these problems into prompts requiring models to generate visual step-by-step solutions with proper mathematical notation and logical flow.
Our automated assessment pipeline employs three metrics: Sequential Consistency for logical flow, Symbol Fidelity for notation accuracy, and Mathematical Correctness for calculation validity, each scored 0-5.
Our evaluation shows that models with built-in language understanding (GPT-Image-1: 84.05\%, Gemini-2.5-Pro: 75.13\%) perform much better than diffusion models (FLUX1.1-Pro: 35.23\%, WAN2.2: 28.05\%, Stable Diffusion 3.5 Ultra: 22.05\%), achieving 2-3 times higher scores.
All models exhibit high Symbol Fidelity (1.81-4.57) but fail at Mathematical Correctness (0.65-4.05), indicating they process mathematical symbols as visual patterns rather than semantic operators.
Diffusion models demonstrate complete difficulty invariance and 33.8\% critical failure rates, confirming absence of mathematical reasoning.
These findings establish that mathematical visualization requires architectural integration of symbolic reasoning with visual generation, beyond current T2I capabilities.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 2732
Loading