Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education
Keywords: Equation-to-Visual Generation, Text-to-Image Evaluation, Multimodal Benchmarks, Visual Representations for Learning, Early Arithmetic Education
Abstract: Visual representations are highly effective in early arithmetic education, as they make abstract mathematical symbols more concrete and support the development of numeracy and reasoning skills. However, creating such visuals is labor-intensive for teachers. In this work, we introduce the equation-to-visual generation task and E2V-Bench, a benchmark for generating pedagogically meaningful visuals from arithmetic equations. Developed with insights from primary school math teachers and informed by visual patterns extracted from six educational resources, E2V-Bench comprises 1.5K arithmetic problems spanning four visual types. We also propose new automatic metrics for evaluating generated visuals. A systematic evaluation on E2V-Bench reveals that open-source text-to-image models perform substantially worse than the strongest closed-source models. Building on these findings, we curate a high-quality training dataset and demonstrate that our model adaptation strategies, including rejection sampling fine-tuning, prompt refinement, and regeneration, significantly improve model performance. This work establishes a foundation for studying equation-to-visual generation and facilitates automated tools that support teachers in creating visuals for arithmetic education.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation,cross-modal application,multimodality
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2071
Loading