Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

Mazda Moayeri; Vidhisha Balachandran; Varun Chandrasekaran; Safoora Yousefi; Thomas FEL; Soheil Feizi; Besmira Nushi; Neel Joshi; Vibhav Vineet

Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas FEL, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

Published: 22 Jan 2025, Last Modified: 28 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: evaluation, skills, tradeoffs, frontier models, llm, large multimodal model, rationale, error analysis, interpretability

TL;DR: Automatic skill-based evaluation of LLMs and LMMs by parsing skills from rationales for existing benchmarks

Abstract: With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated {\em rationales}. After validating the relevance of rationale-parsed skills and inferring skills for $46$k instances over $12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of \emph{skill-slices} (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is $18\%$ more accurate in \emph{computing molar mass}, but $19\\%$ less accurate in \emph{applying constitutional law}, despite the overall accuracies of the three models differing by a mere $0.4\\%$. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a $3\\%$ accuracy improvement over our $12$ dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9244

Loading