STAT: Skill-Targeted Adaptive Training

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: metacognition, out-of-distribution generalization, dataset reusability, skill-level training
TL;DR: We introduce a distillation strategy, AdaptDistill, that select or synthesize training data targeted on models' missing skills.
Abstract: Small language models (SLMs) often show little to no improvement when trained on data similar to those in their training set (e.g., MATH). We introduce a distillation strategy, AdaptDistill, that enables a teacher model to help such a student SLM. The teacher uses its metacognition to create a list of skills needed for the task [Didolkar et al., 2024], and to label each data point with the skills required for it. AdaptDistill constructs a missing-skill profile for the SLM by identifying which skills were absent in the model’s responses and how frequently each skill was missing. We propose AdaptDistill-selected, which performs a weighted selection of training examples according to the missing-skill profile. We also propose AdaptDistill-synthetic, an analogous method where teacher LLM synthesizes additional examples, again targeted to the missing skills. On MATH, both methods improve Llama-Instruct models by up to 7.5% where naive fine-tuning fails, and also enhance out-of-distribution performances. These results highlight the promise of skill-aware targeted training for improving SLMs.
Submission Number: 87
Loading