The Impact of Training Data Composition on Reinforcement Learning with Verifiable Rewards: Theoretical Analysis and Empirical Investigation
Keywords: reinforcement learning, verifiable rewards, training data
TL;DR: This paper provides a comprehensive theoretical analysis of how training data composition fundamentally affects RLVR performance across multiple dimensions: reward signal quality, verification complexity, and generalization capability.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) represents a paradigm shift in training AI systems by incorporating explicit reward verification mechanisms. This paper provides a comprehensive theoretical analysis of how training data composition fundamentally affects RLVR performance across multiple dimensions: reward signal quality, verification complexity, and generalization capability. Through rigorous mathematical analysis, we establish convergence guarantees, sample complexity bounds, and optimal data composition ratios for RLVR systems. We introduce the Verifiable Reward Consistency Index (VRCI) and its robust extension for noisy constraints (VRCI-R) with theoretical justification for their effectiveness. Our theoretical framework demonstrates that optimal RLVR performance requires a precise balance between verified and exploratory samples, with mathematical bounds on the optimal verification coverage ratio. We provide novel theoretical results on hierarchical verification constraints, noisy constraint handling, and the fundamental limits of verifiable learning. Additionally, we present preliminary empirical validation of our theoretical claims and practical implementation guidelines for real-world RLVR systems.
Submission Number: 228
Loading