GeomVerse: A Systematic Evaluation of Large Models for Geometric ReasoningDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Large language models have shown impressive results for multi-hop mathematical reasoning when the input question is only textual. Many mathematical reasoning problems, however, contain both text and image. With the ever-increasing adoption of vision language models (VLMs), understanding their reasoning abilities for such problems is crucial. In this paper, we evaluate the reasoning capabilities of VLMs along various axes through the lens of geometry problems. We procedurally create a synthetic dataset of geometry questions with controllable difficulty levels along multiple axes, thus enabling a systematic evaluation. The empirical results obtained using our benchmark for state-of-the-art VLMs indicate that these models are not as capable in subjects like geometry (and, by generalization, other topics requiring similar reasoning) as suggested by previous benchmarks. This is made especially clear by the construction of our benchmark at various depth levels, since solving higher-depth problems requires long chains of reasoning rather than additional memorized knowledge.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: english
Preprint Status: There is a non-anonymous preprint (URL specified in the next question).
A1: yes
A1 Elaboration For Yes Or No: The limitations are discussed right after the conclusion.
A2: yes
A2 Elaboration For Yes Or No: The limitations are discussed right after the conclusion.
A3: yes
A3 Elaboration For Yes Or No: The abstract provides a succinct description of the main findings and the last paragraph of the introduction lists the findings in detail.
B: yes
B1: yes
B1 Elaboration For Yes Or No: We created a new dataset. We used existing LLMs/VLMs and cited them accordingly.
B2: yes
B2 Elaboration For Yes Or No: The dataset will be made publicly available under a CC BY license.
B3: yes
B3 Elaboration For Yes Or No: We used models for generating solutions to geometry problems and then evaluated them.
B4: n/a
B5: n/a
B6: yes
B6 Elaboration For Yes Or No: This is provided in the main body and the implementation details section in the appendix.
C: yes
C1: yes
C1 Elaboration For Yes Or No: Only when possible (the number of parameters for some models are unknown)
C2: yes
C2 Elaboration For Yes Or No: See implementation details.
C3: no
C3 Elaboration For Yes Or No: Due to high cost, experiments were ran only once.
C4: n/a
D: no
D1: n/a
D2: n/a
D3: n/a
D4: n/a
D5: n/a
E: no
E1: n/a
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview