How well does GPT-4o understand vision? Solving standard computer vision tasks with multimodal foundation models

Rahul Ramachandran; Ali Garjani; Andrei Atanov; Oğuzhan Fatih Kar; Amir Zamir

How well does GPT-4o understand vision? Solving standard computer vision tasks with multimodal foundation models

Rahul Ramachandran, Ali Garjani, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

26 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal foundation models, computer vision

TL;DR: We develop prompt chaining techniques for multimodal foundation models, e.g. GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, Qwen2-VL, etc. to solve standard computer vision tasks such as semantic segmentation and depth estimation.

Abstract: Multimodal foundation models, such as GPT-4o, have made remarkable progress recently. However, it is not clear exactly where these models stand in terms of understanding vision. {In this paper, we \textbf{quantify the performance of popular multimodal foundation models} (GPT-4o, Gemini Pro, Claude 3.5 Sonnet, Qwen2-VL) \textbf{at standard computer vision tasks} (semantic segmentation, object detection, image classification, depth and surface normal prediction) and \textbf{using established datasets} (e.g., COCO, ImageNet and its variants, etc).} The main challenges to performing this are: \textbf{1)} the models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and \textbf{2)} many of the leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via {prompt chaining}. We observe that \textbf{1)} the models are not close to the state-of-the-art at any tasks, and \textbf{2)} they perform semantic tasks notably better than geometric ones. However, \textbf{3)} they are respectable generalists; this is remarkable as they are presumably trained on only image-text-based tasks primarily. \textbf{4)} While the prompting techniques affect the performance, better models exhibit less sensitivity to prompt variations. \textbf{5)} GPT-4o performs the best, getting the top position in 5 out of 6 tasks.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6722

Loading