GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language
Keywords: geometry construction, benchmark, evaluation, executable grounding, structural hallucination, multimodal agents
Abstract: We introduce **GeoBuildBench**, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and
code are released at https://anonymous.4open.science/r/GeoBuildBench-3032.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Chinese
Submission Number: 9491
Loading