Keywords: Generative Models, Reasoning, post-training
Abstract: Reinforcement Learning-based post-training of
Large Language Models (LLM) has been successfully applied to improve their reasoning capabilities. Existing pipelines primarily finetune LLMs
on a fixed pool of problems specified prior to
training using the GRPO loss. This is fundamentally limiting, as learning signal arises only when
policy rollouts mix successes and failures, causing the useful portion of any fixed pool to quickly
become stale as the model improves. To address
this, we propose frontier learning, an open-ended
post-training approach in which procedural
generators are used online to continually produce
informative training problems. It treats the
generator’s task-specific parameters as a search
space and uses a novel regret signal to prioritize
and explore frontier difficulty levels in order to
focus training at the edge of the model’s evolving
reasoning capabilities. Across several reasoning
tasks, our approach consistently achieves higher
relative gains over fixed-pool baselines, demonstrating that effective post-training requires not
only selecting useful problems, but continually
generating them at the edge of capability.
Submission Number: 189
Loading