Taming Large Language Models for Free-Form Generation Via Reinforcement Learning With Verifiable Rewards

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, free-form RL generation
Abstract: Evaluating open-ended free-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose semantic evaluation, a scoring model using an LLM as reward model for evaluating open-ended free-form generation in GRPO and guiding its training to produce enough distinct rewards for good and bad outputs. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that using LLM scorers trained on multi-sentence and paragraph-length responses, remains more reliable across varied long passages and aligns well with the verifiable rewards GRPO needs than standard free-form metrics. Human evaluations confirm that using trained LLM rewards as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7973
Loading