STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

ACL ARR 2026 January Submission2370 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Skill Gaps, Scaffolding, Benchmark Evaluation, Variations
Abstract: Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve it. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model's unique and distinct skill gaps.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Human-Centered NLP, Interpretability and Analysis of Models for NLP, Language Modeling, Linguistic Theories, Cognitive Modeling, and Psycholinguistics, Resources and Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Data analysis, Theory
Languages Studied: English, Natural language
Submission Number: 2370
Loading