Keywords: constrained generation, large language model, skill evaluation, LLM benchmark, emergence
Abstract: With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. This capability to combine skills plays an important role in (human) pedagogy and also in a recent paper on emergence phenomena (Arora & Goyal, 2023). A new evaluation, Skill-Mix, is introduced to measure this capability. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text it has not seen in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model.
Administering a version of Skill-Mix to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities ---including suspected cases of ``cramming for the leaderboard''--- that are not captured by their ranking on popular LLM leaderboards. Our methodology can flexibly change to future models and model capabilities, by expanding the set of skills being tested and increasing $k$. By publicly releasing the Skill-Mix methodology, we hope it may grow into an eco-system of open evaluations for AI capabilities, including in multi-modal settings. These may serve as more trustworthy gauges of model capabilities than current leaderboards.
Submission Number: 74
Loading