AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

ICLR 2026 Conference Submission15738 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Benchmarking; Multimodal Large Language Models; Agent

TL;DR: We employ a MLLM-based agent as a judger to select subsets from exisiting benchmarks for efficient evaluation of MLLMs.

Abstract: Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce ***AutoJudger***, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model’s real-time performance. Specifically, AutoJudger incorporates two pivotal components: *a semantic-aware retrieval mechanism* to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and *a dynamic memory* that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4\% of the data and 10\% computational cost (for evaluating 7B model) to achieve over 90\% ranking accuracy with the full-benchmark evaluation results on MMT-Bench.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 15738

Loading