Towards Geometry Problems Solving Employing GPT-4 Vision with Few-Shot Prompting: An Empirical Study of What Matters

26 Sept 2024 (modified: 13 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Mathematical Reasoning, Geometry Problem Solving, Prompting Methods
Abstract:

The few demonstrations ("few-shot prompting") can significantly improve the ability of Large Language Models (LLMs) in mathematical reasoning, including geometry problem solving (GPS). GPT-4 Vision (GPT-4V), as a leading example of LLMs, also demonstrates significant improvements. This tremendous achievement is mainly attributed to prompting methods like "Chain-of-Thought" and "Program-of-Thought," which leverage the in-context learning ability of the model combined with few-shot prompting to solve new problems. Despite the success of these prompting methods, it remains understood what the GPT-4V model learns from the demonstrations that lead to improved performance. In this paper, we evaluated the answering accuracy of GPT-4V with 2-shot prompting on five geometric problem datasets and conducted a series of detailed analyses. Firstly, through ablation experiments with valid and invalid demonstration examples, we found that the model’s performance improvement is not due to the quality of the demonstration, but rather to the input format, output format, and logic and structure of the demonstration. Secondly, by analyzing the reasoning and computational requirements of geometric problems, and verifying experimental results, we found that GPS tasks emphasize reasoning ability more than computational power. Finally, our analysis of various prompt methods revealed that existing approaches are not effective at improving model performance concerning problem length and geometric shape. Therefore, specialized prompt methods could be designed to enhance the model's performance in these aspects, or fine-tuning the model by adding problem data with longer lengths or mixed geometric shapes could optimize its performance. Overall, developing an LLM that fully adapts to GPS tasks represents a key research direction. The source code and data will be made available in a GitHub repository.

Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6698
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview