ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Chunyuan Li; Haotian Liu; Liunian Harold Li; Pengchuan Zhang; Jyoti Aneja; Jianwei Yang; Ping Jin; Houdong Hu; Zicheng Liu; Yong Jae Lee; Jianfeng Gao

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, Jianfeng Gao

Published: 17 Sept 2022, Last Modified: 04 Aug 2025NeurIPS 2022 Datasets and Benchmarks Readers: Everyone

Keywords: evaluation platform, task-level transfer, language-image pre-training, image classification, object detection

TL;DR: ELEVATER provides the first public platform and toolkit to evaluate vision foundation models in their large-scale task-level visual transfer in 20 image classification tasks and 35 object detection tasks

Abstract: Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets/tasks. However, it remains challenging to evaluate the transferablity of these foundation models due to the lack of easy-to-use toolkits for fair benchmarking. To tackle this, we build ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer), the first benchmark to compare and evaluate pre-trained language-augmented visual models. Several highlights include: (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to ensure the fairness in model adaption. To leverage the full power of language-augmented visual models, novel language-aware initialization methods are proposed to significantly improve the adaption performance. (iii) Metrics. A variety of evaluation metrics are used, including sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). We will publicly release ELEVATER.

Author Statement: Yes

URL: https://computer-vision-in-the-wild.github.io/ELEVATER/

Dataset Url: https://github.com/Computer-Vision-in-the-Wild/DataDownload

License: Our code is under MIT license. The data of our benchmark includes two parts: (1) A suite of public datasets. Please follow the original license of each dataset. (2) External knowledge. The knowledge extracted from WordNet/Wiktionary follows their own licenses. For the knowledge extracted from GPT3, we have the approval from OpenAI to use it for this research benchmark to encourage future studies.

Supplementary Material: zip

Contribution Process Agreement: Yes

In Person Attendance: No

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/elevater-a-benchmark-and-toolkit-for/code)

19 Replies

Loading