CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom

CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom

ACL ARR 2025 February Submission1426 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Distilling advanced Large Language Models' instruction-following capabilities into smaller models using a selected subset has become a mainstream approach in model training. While existing synthetic instruction data selection strategies can identify valuable subsets for distillation, they predominantly rely on single-dimensional signals (i.e., reward scores, model perplexity). We argue that such narrow signals may overlook essential nuances of user instructions, especially when each instruction can be answered from multiple perspectives. Therefore, we investigate more diverse signals to capture comprehensive instruction-response pair characteristics and propose three foundation metrics that leverage Multi-LLM wisdom: (1) diverse responses across multiple LLMs and (2) reward model assessment. Based on these metrics, we propose CrowdSelect, which combines all three metrics with diversity preservation through clustering. Our comprehensive experiments demonstrate that our foundation metrics consistently improve performance across 4 base models on MT-bench and Arena-Hard. Our CrowdSelect, as an integrated metric, achieves state-of-the-art performance in both Full and LoRA fine-tuning, showing improvements of 4.81% on Arena-Hard and 11.1% on MT-bench with Llama-3.2-3b-instruct. We hope our findings will bring valuable insights for future research in this direction.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Instruction Tuning, Large Language Model, Data Selection, Model Distillation, Synthetic Data

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 1426

Loading