Certifiable Evaluation: A Low-Rank Framework for Foundation Model Benchmarking with Formal Performance Guarantees
Keywords: Certifiable Evaluation, Foundation Models, Benchmarking, Latent Capability Model, Low-Rank Decomposition, PAC Certificates, Optimal Experimental Design, Minimum Sufficient Evaluation Set
TL;DR: This paper introduces a low-rank Latent Capability Model that provides formal, provable performance guarantees for foundation models on unseen tasks by evaluating them on a minimal, optimally selected subset of benchmark tests.
Abstract: Benchmark scores for foundation models are point estimates that carry no formal guarantees about performance on unseen tasks. We address this gap by developing a rigorous theory of certifiable evaluation grounded in a Latent Capability Model (LCM): the $n \times m$ matrix of model--task performance scores is posited to have rank $r \ll \min(n, m)$, with each score decomposing as an inner product of a model capability vector and a task requirement vector, perturbed by sub-Gaussian noise. Within this framework we prove (i) an information-theoretic lower bound showing that any benchmark with fewer than $r$ tasks cannot produce non-trivial performance certificates for unseen tasks; (ii) an oracle PAC certificate that translates benchmark scores into prediction intervals for unseen tasks, with width scaling as $\mathcal{O}\left(\sigma \sqrt{k}/\sigma_{\min}(V_B)\right)$; (iii) a D-optimal benchmark selection criterion---the Minimum Sufficient Evaluation Set (MSES)---that maximises certificate quality for a given budget, with a greedy algorithm achieving a $(1 - 1/e)$-approximation. We validate all theoretical claims on a cross-benchmark performance matrix comprising 157 publicly available language models evaluated across a unified 86-task cross-benchmark matrix sampled from six benchmark suites whose full task counts total 376. Empirically, the effective rank is $r_{\mathrm{eff}} = 8$ despite 86 observed tasks, implying that the full benchmark suites carry substantial per-suite redundancy (82--95\%, consistent with Table 1). MSES achieves nominal certificate coverage to within 0.8 percentage points at $1 - \delta = 0.95$, while halving certificate width relative to random task selection.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 40
Loading