Seeing and Solving: An Interpreter-Solver Framework for Geometric Reasoning with Large Vision and Language Models
Abstract: Geometrical Problem Solving (GPS), which involves interpreting diagrams and text to solve problems using logical reasoning and mathematical principles, has gained significant attention with the advancement of Multimodal Large Language Models (MLLMs). However, solving these problems in a zero-shot setting has received comparatively little attention, despite the growing improvements in AI reasoning for visual mathematics understanding. In this study, we propose Interpreter-Solver, a two-stage pipeline that seamlessly integrates Vision Language Models (VLMs) and Large Language Models (LLMs) to address these issues. Our approach harnesses the VLM's visual understanding to extract formal textual descriptions of geometric relationships, which are then processed by the LLM for its outstanding reasoning capabilities. This entire process employs a zero-shot prompting strategy to resolve the previous challenges. Without any fine-tuning, it establishes itself as a new state-of-the-art by achieving accuracies of 83.19% on the Geometry3K dataset and 69.67% on the MathVerse dataset. It surpasses leading methods like InterGPS, GeoDRL, and AutoGPS while requiring 5x and 2.8x fewer parameters than the top models on these benchmarks. https://anonymous.4open.science/r/Interpreter-Solver/
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: math QA, mathematical NLP, LLM/AI agents, zero/few-shot extraction, multimodal QA, logical reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 825
Loading