Abstract: Distilling advanced Large Language Models' instruction-following capabilities into smaller models using a selected subset has become a mainstream approach in model training. While existing synthetic instruction data selection strategies can identify valuable subsets for distillation, they predominantly rely on single-dimensional signals (i.e., reward scores, model perplexity). We argue that such narrow signals may overlook essential nuances of user instructions, especially when each instruction can be answered from multiple perspectives. Therefore, we investigate more diverse signals to capture comprehensive instruction-response pair characteristics and propose three foundation metrics that leverage Multi-LLM wisdom: (1) diverse responses across multiple LLMs and (2) reward model assessment. Based on these metrics, we propose CrowdSelect, which combines all three metrics with diversity preservation through clustering.
Our comprehensive experiments demonstrate that our foundation metrics consistently improve performance across 4 base models on MT-bench and Arena-Hard. Our CrowdSelect, as an integrated metric, achieves state-of-the-art performance in both Full and LoRA fine-tuning, showing improvements of 4.81% on Arena-Hard and 11.1% on MT-bench with Llama-3.2-3b-instruct. We hope our findings will bring valuable insights for future research in this direction.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Instruction Tuning, Large Language Model, Data Selection, Model Distillation, Synthetic Data
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 1426
Loading