Actor-Curator: Co-adaptive curricula via policy-improvement bandits for post-training

Zhengyao Gu; Jonathan Light; Raul Astudillo; Ziyu Ye; Langzhou He; Wei Cheng; Santiago Paternain; Philip S. Yu; Yisong Yue

Actor-Curator: Co-adaptive curricula via policy-improvement bandits for post-training

Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM Reasoning OralEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: large language models, RL post-training, curriculum learning, adaptive data selection, policy optimization

TL;DR: A learned curator adaptively selects training problems to maximize policy improvement during RL post-training of LLMs

Abstract: Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

Presenter: ~Jonathan_Light1

Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Submission Number: 108

Loading