Keywords: Item-Response Theory, Evaluation, Accessibilty, Rating, Comprehension, Rasch Model, Wright Map
TL;DR: Evaluation of Vision Language Models with Item Response Theory
Abstract: Evaluation of generative AI output is difficult because of the high dimensional nature of the problem space. Accuracy-oriented benchmarks are often used to assess the quality of outputs, but may not provide a complete picture because they do not incorporate the difficulty of items relating to a particular task. We present the use of Item Response Theory (IRT) to evaluate the output of a cohort of Vision Language Models (VLMs) on two different tasks: image caption rating, and visual reading comprehension. As a result, we show how to find meaningful differences between popular state of the art VLMs, promoting meaningful interpretability. IRT can be used in many areas of the ML workflow. We will cover some of the prior work and ways IRT may be used in your research. Our aim is to encourage the adoption of IRT by the ML community as a general tool for evaluation.
Primary Area: interpretability and explainable AI
Submission Number: 23048
Loading