MEMETRON: Memetic Response Optimizer for Reward-Guided Post-Decoding Optimization of Large Language Model

TMLR Paper7799 Authors

06 Mar 2026 (modified: 14 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modern large language models (LLMs) are commonly optimized using scalar reward signals defined over completed responses, applied both during training and at inference time. However, most such reward-guided post-decoding methods remain one-shot: they independently sample a set of responses, score each once, and select the best. Staying shallow and narrow leaves higher-reward responses unrealized, while scaling up to shallow and wide sampling exacerbates reward hacking, making downstream selection methods such as best-of-$N$ and self-consistency unreliable. We propose \MEMETRON, a memetic optimization framework that formulates reward-guided post-decoding optimization (RPDO) as discrete black-box optimization over completed responses. \MEMETRON alternates between \GENETRON for population-based optimization and \ANNETRON for annealing-based local refinement under a black-box scalar reward. Across mathematical reasoning and instruction-following tasks, \MEMETRON reliably discovers higher-scoring responses. On mathematical reasoning, \MEMETRON increases pass@$k$ correctness coverage and improves the selection reliability of best-of-$N$ and self-consistency; on instruction following, it improves LLM judge preference. On verifiable tasks, \MEMETRON can incorporate ground-truth correctness via reward shaping. Comparing shaped and unshaped runs exposes extreme cases of RM-correctness misalignment, and the resulting contrastive pairs serve as training signal for reward model fine-tuning, rejection sampling SFT warmups for RL-based training pipelines such as PPO and GRPO, and direct preference learning such as DPO.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Dennis_J._N._J._Soemers1
Submission Number: 7799
Loading