Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

Jiangyong Huang; William Yicheng Zhu; Baoxiong Jia; Zan Wang; Xiaojian Ma; Qing Li; Siyuan Huang

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

Jiangyong Huang, William Yicheng Zhu, Baoxiong Jia, Zan Wang, Xiaojian Ma, Qing Li, Siyuan Huang

Published: 01 Feb 2023, Last Modified: 04 Aug 2025Submitted to ICLR 2023Readers: Everyone

Keywords: general-purpose vision, benchmark, visual representation

TL;DR: We propose a comprehensive benchmark for holistic evaluation of general-purpose visual representations, as well as a general framework to mitigate gaps among visual tasks and accommodate arbitrary representations

Abstract: Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts at general vision models are limited to a narrow range of tasks and offer no overarching framework to perform visual tasks holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four disjoint functional domains — Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework for the tasks in G-VUE, to accommodate arbitrary visual representations on all 11 tasks. With our benchmark and framework, we evaluate 7 typical visual representations and observe that (1) transformer and more data empirically lead to more general-purpose, (2) language plays a significant role in learning versatile visual representation, and (3) correlations indicate a subtle constituent among tasks despite the distinctions, which could be evidence of general-purpose. We argue that instead of pursuing general-purpose vision models by end-to-end multi-task training, it is more reasonable to evaluate and investigate representations, which helps digest emerging pre-trained vision models and hopefully shed light on general intelligence.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Infrastructure (eg, datasets, competitions, implementations, libraries)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/perceive-ground-reason-and-act-a-benchmark/code)

12 Replies

Loading