When Evolution Strategy Meets Language Models Tuning

Published: 2025, Last Modified: 28 Feb 2026COLING 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Supervised Fine-tuning has been pivotal in training autoregressive language models, yet it introduces exposure bias. To mitigate this, Post Fine-tuning, including on-policy and off-policy methods, has emerged as a solution to enhance models further. However, each has its limitations regarding performance enhancements and susceptibility to overfitting. In this paper, we introduce a novel on-policy approach called Evolution Strategy Optimization (ESO), which is designed by harnessing the principle of biological evolution, namely survival of the fittest. Particularly, we consider model tuning as an evolution process, and each output sentence generated by the model can provide a perturbation signal to the model parameter space. Then, the fitness of perturbation signals is quantified by the difference between its score and the averaged one offered by a reward function, which guides the optimization process. Empirically, the proposed method can achieve superior performance in various tasks and comparable performance in the human alignment task.
Loading