Keywords: metacognition, out-of-distribution generalization, dataset reusability, skill-level training
TL;DR: We introduce a distillation strategy, AdaptDistill, that select or synthesize training data targeted on models' missing skills.
Abstract: Small language models (SLMs) often show little to no improvement when trained via usual SFT on data similar to what they saw in their training set (e.g., MATH). We introduce a new strategy for the teacher, AdaptDistill, that enables it to help such a student SLM. The teacher uses its metacognition to create a list of skills needed for the task, and to label each data point with the skills required for it. The teacher monitors the student’s outputs and returns a missing-skill profile for the student corresponding to how often it failed to apply each skill in its responses. This idea is used to improve teaching in one of two ways. In AdaptDistill-selected the teacher uses a fixed set of training examples but adaptively reweights them according to the missing-skill profile. In AdaptDistill-synthetic the teacher synthesizes additional examples on the fly involving skills that the student is currently weak on. On MATH, both methods improve Llama-Instruct models by up to 7.5%, whereas vanilla fine-tuning fails, as do prior methods for data selection + fine-tuning. Our methods also enhance out-of-distribution performance. These results highlight the promise of skill-aware targeted training for improving SLMs.
Submission Number: 136
Loading