Adaptive Curriculum Learning for RLHF with Influence-Based Cluster Bandits

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLHF, Curriculum Learning, GRPO
Abstract: Reinforcement learning (RL) plays a central role in post-training large language models (LLMs). Yet, existing RLHF pipelines typically rely on fixed or uniform sampling strategies, which fail to adapt to the model’s evolving learning state. This mismatch leads to wasted computation on less informative samples while neglecting instances with higher training impact, ultimately limiting efficiency, generalization, and performance gains. We introduce an adaptive curriculum learning framework that integrates influence-based clustering with a multi-armed bandit (MAB) scheduler. Training data are partitioned into clusters defined by semantic and difficulty-related features, each treated as an arm in the MAB formulation. A Cluster Score (CS), updated via sliding-window influence functions, quantifies the dynamic importance of each cluster as the model evolves. This adaptive scoring drives the scheduler to balance exploitation of high-impact clusters with exploration of underrepresented regions, ensuring efficient learning while maintaining diversity. Unlike prior approaches that overfit to narrow high-reward subsets, our cluster-level sampling prevents redundancy and broadens representational coverage. Experiments with Group Relative Policy Optimization across mathematical reasoning benchmarks show that our method consistently accelerates convergence and improves generalization. These results highlight the value of distribution-level adaptive curricula in advancing RLHF for LLM training.
Primary Area: reinforcement learning
Submission Number: 11782
Loading