Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Model Evaluation

Peiwen Yuan; Yueqi Zhang; Shaoxiong Feng; Yiwei Li; Xinglin Wang; Jiayi Shi; Chuyi Tan; Boyuan Pan; Yao Hu; Kan Li

Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Model Evaluation

Peiwen Yuan, Yueqi Zhang, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

28 Sept 2024 (modified: 21 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: evaluation, efficient, benchmark

TL;DR: We propose an efficient evaluation method by measuring the model on an adaptively constructed tailored coreset.

Abstract: Evaluating models on large benchmarks can be very resource-intensive, especially during a period of rapid model iteration. Existing efficient evaluation methods approximate the performance of target models by assessing them on a small static coreset derived from publicly available evaluation results of source models. However, these approaches rely on the assumption that each target model has a high prediction consistency with source models, which doesn’t generalize well in practice, leading to inaccurate performance estimates. To fill this gap, we propose TailoredBench, a method that provides customized evaluations tailored to each target model. Specifically, a Global-coreset is first constructed as a probe to identify the most consistent source models for the target models with an adaptive source model selection strategy. Afterwards, a scalable K-Medoids clustering algorithm is proposed to extend the Global-coreset to tailored Native-coreset for each target model. According to the predictions on respective Native-coreset, we estimate the overall performance of target models with a calibrated restoration strategy. Comprehensive experiments on five benchmarks across more than 300 models demonstrate that compared to best performing baselines, TailoredBench achieves an average reduction of 24.8% in the MAE of accuracy estimates, showcasing strong effectiveness and generalizability.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13828

Loading