Language Task Difficulty Prediction Through LLM-Annotated Meta-Features

Published: 01 Jan 2024, Last Modified: 19 Feb 2025ECAI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Assessing the capabilities of large language models (LLMs) is increasingly challenging due to their generality and uneven task performance. Often, we do not know how much of the success or failure on a particular task is due to the ‘loading’ of the language elements in the task, such as narrative understanding, or some other intrinsic (non-linguistic) components, such as domain-specific common sense or reasoning capabilities. Understanding what tasks are most loaded on language and determine the predictability of LLMs on these tasks is crucial for improving benchmarks, designing better LLMs, and ensuring their safe deployment. We present an innovative methodology that uses LLMs to annotate linguistic meta-features, allowing us to predict task difficulty and understand linguistic loadings more accurately than traditional readability scores. Using GPT-4 for automated annotation, we show strong predictability for a variety of tasks and language models (e.g., MMLU with R2 from 0.68 to 0.83), but observe limited predictability for other tasks (e.g., LSAT with R2 of -0.07).
Loading