When Evolution Strategy Meets Language Models Tuning

ACL ARR 2024 June Submission4292 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Supervised Fine-tuning has been pivotal in training autoregressive language models, yet it introduces exposure bias. To mitigate this, Post Fine-tuning, including on-policy and off-policy methods, has emerged as a solution to enhance models further, though each has its limitations regarding performance enhancements and susceptibility to overfitting. In this paper, we introduce a novel on-policy approach, called \textbf{Evolution Strategy Optimization} (ESO), which is designed by harnessing the principle of biological evolution, namely \emph{survival of the fittest}. Particularly, we consider model tuning as an evolution process, and each output sentence generated by the model can provide a perturbation signal to the model parameter space. Then, the fitness of perturbation signals is quantified by the difference between its score and the averaged one offered by a reward function, steering optimization process. Empirically, the proposed method can achieve superior performance in various tasks and comparable performance in the human alignment one. The code will be publicly available.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: continual learning; fine-tuning;
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 4292
Loading