GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

ACL ARR 2026 January Submission9491 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: geometry construction, benchmark, evaluation, executable grounding, structural hallucination, multimodal agents

Abstract: We introduce **GeoBuildBench**, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are released at https://anonymous.4open.science/r/GeoBuildBench-3032.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation, Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: Chinese

Submission Number: 9491

Loading