Reward-Aware Population Scaling of Evolutionary Strategies in LLM Fine-Tuning

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: evolutionary strategies, zeroth-order optimization, LLM fine-tuning, reward sparsity, population scaling
TL;DR: Small-population failure in binary-reward ES fine-tuning of LLMs can arise from reward sparsity and z-score advantage normalization, rather than an intrinsic population-size limit.
Abstract: Using Evolutionary Strategies (ES) for fine-tuning large language models is attractive because it is memory-efficient, parallel, and compatible with black-box or discrete rewards. Yet its population-size conclusions conflict sharply: fine-tuning with cross-entropy (CE) reward succeeds with $N=1$, while binary-reward training often needs $N \approx 30$. We show this gap is largely about reward design and normalization, not population size. Binary accuracy reward induces a zero-advantage probability $q$ that depends in closed form on base accuracy, batch size, and intra-pair correctness correlation; a zero-training probe on Qwen2.5-Instruct/GSM8K matches the formula with mean absolute error 0.020 across 12 configurations and finds the availability threshold $N_{\mathrm{avail}}$ to be small in this capable-model regime. In this regime, z-score advantage normalization—not population size—can cause $N=2$ to fail. Disabling normalization lets binary-reward ES with $N=2$ improve on both GSM8K and TREC, where the normalized variant collapses or degrades. The implication is not that $N=2$ is universally sufficient, but that small-population failure in capable-model binary ES can be an implementation artifact rather than an intrinsic population limit.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 191
Loading