Rethinking the Instruction Quality: LIFT is What You Need

ACL ARR 2024 April Submission654 Authors

16 Apr 2024 (modified: 20 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Instruction tuning, a specialized technique to enhance large language model (LLM) performance via instruction datasets, relies heavily on the quality of the employed data. Existing quality improvement methods alter instruction data through dataset expansion or curation. However, the expansion method introduces the risk of data deficiency and redundancy, potentially compromising the correctness and accuracy of the LLM's knowledge, while the curation approach confines the LLM's potential to the original dataset. Our aim is to surpass the original data quality without confronting these shortcomings. To achieve this, we propose LIFT (LLM Instruction Fusion Transfer), a novel and versatile paradigm designed to elevate the instruction quality to new heights. LIFT strategically broadens data distribution to encompass more high-quality subspaces and eliminates redundancy, concentrating on high-quality segments across overall data subspaces. Experimental results demonstrate that, even with a limited quantity of high-quality instruction data selected by our paradigm, LLMs not only consistently uphold robust performance across natural language understanding and code generation tasks but also surpass many state-of-the-art results, highlighting the significant improvement in instruction quality achieved by our paradigm.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: data augmentation
Contribution Types: Approaches low compute settings-efficiency, Data analysis
Languages Studied: English
Submission Number: 654
Loading