Keywords: instruction tuning, high quality synthetic data, diverse synthetic data
TL;DR: We introduce Instruct-SkillMix, an automated approach for creating diverse, high quality SFT data for instruction-following.
Abstract: We introduce Instruct-SkillMix, an automated approach for creating diverse, high quality SFT data for instruction-following. The pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core ``skills'' for instruction-following, either by directly prompting the model, or prompting it to identify skills needed for existing datasets (Didolkar et al., 2024); (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty.
Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from Instruct-SkillMix leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just 4K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0, a level similar to frontier models like Claude 3 Opus and LLaMA-3.1-405B-Instruct. The estimated cost of creating the dataset is $600.
Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. In our dataset, adding 20% low quality answers (``shirkers'') causes performance to plummet, sometimes catastrophically.
The Instruct-SkillMix pipeline is flexible and the ideas are adaptable to other settings.
Submission Number: 103
Loading