Keywords: Out-of-Distribution, MultiModal Large Language Model, Benchmark, Dataset
TL;DR: Benchmark and Dataset
Abstract: Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive amounts of data and with the assumption that the data are Independent and Identically Distributed, IID. However, we must embrace the fact that, in real-world scenarios, it is not only difficult but also impractical to expect that all data processed by an AI system would satisfy this assumption. Furthermore, if an AI system chooses to ignore out-of-distribution, OOD, objects during processing, it may cause safety hazards or even lead to catastrophic consequences in real-world applications (e.g., autonomous driving, medical assistance scenarios, etc.). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance category pairs, and we also show that VLMs still struggle to process natural image categories in OODBench, despite those categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data. The dataset is open to the public for research.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 9193
Loading