Keywords: Capability Evaluation, Automated Evaluation, Evaluation
Abstract: Current evaluation frameworks for foundation models rely on fixed, manually curated benchmarks, limiting coverage of model capabilities. We propose Active learning for Capability Evaluation, a scalable framework for automated fine-grained evaluation. Our framework leverages language models to decompose domains into semantically meaningful capabilities and generate diverse tasks, reducing human effort. It models a subject model’s performance as a capability function over a latent semantic space and applies active learning to prioritize the most informative evaluations. This adaptive strategy enables cost-efficient discovery of strengths, weaknesses, and failure modes that static benchmarks may overlook. Results show that this evaluation yields a more complete picture of model capabilities.
Submission Number: 124
Loading