Self-Evolution Fine-Tuning for Policy Optimization

Self-Evolution Fine-Tuning for Policy Optimization

ACL ARR 2024 June Submission936 Authors

13 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The alignment of large language models (LLMs) is crucial not only for unlocking their potential in specific tasks but also for ensuring that responses meet human expectations and adhere to safety and ethical principles. To address the challenges of current alignment methodologies, we introduce self-evolution fine-tuning (SEFT) for LLM alignment, aiming to eliminate the need for annotated samples while retaining the stability and efficiency of SFT. SEFT first trains an adaptive reviser to elevate low-quality responses while maintaining high-quality ones. The reviser then gradually guides the policy's optimization by fine-tuning it with enhanced responses. The method excels in utilizing unlimited unannotated data to optimize policies via supervised fine-tuning. Our experiments on AlpacaEval and MT-Bench demonstrate the effectiveness of SEFT and its advantages over existing alignment techniques.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: fine-tuning

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 936

Loading