Interpretable Intrinsic Cues for Efficient Reinforcement Learning with Large Language Models

10 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Intrinsic Cues, Data Efficiency, Reinforcement Learning with Verifiable Rewards
TL;DR: We propose PREPO, leveraging intrinsic cues (i.e., prompt perplexity and sequence-level entropy), for data-efficient reinforcement learning with verifiable rewards, cutting rollout costs by up to 3× with interpretable training dynamics.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization relative to their heavy computational demands. This study investigates how simply leveraging interpretable and intrinsic data properties, which come at almost no additional computational cost during training, can markedly improve data efficiency for RLVR. We propose PREPO, an RLVR model with two complementary components. First, we use prompt perplexity as a proxy for model adaptability in learning, and adopt a schedule to guide the model from well-understood prompts to progressively challenging ones. Second, we amplify the diversity among rollouts by differentiating their relative entropy and prioritizing sequences with greater exploratory behavior. Together, these mechanisms reduce rollout demand while preserving competitive performance. On Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3× fewer rollouts than baselines. Beyond empirical gains, we provide theoretical and in-depth analyses that explain how our method improves the data efficiency of RLVR.
Primary Area: interpretability and explainable AI
Submission Number: 3612
Loading