Interpretable Intrinsic Cues for Efficient Reinforcement Learning with Large Language Models

Interpretable Intrinsic Cues for Efficient Reinforcement Learning with Large Language Models

ACL ARR 2026 January Submission4544 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Intrinsic Cues, Data Efficiency, Reinforcement Learning with Verifiable Rewards

Abstract: Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization relative to their heavy computational demands. This study investigates how simply leveraging interpretable and intrinsic data properties, which come at almost no additional computational cost during training, can markedly improve data efficiency for RLVR. We propose PREPO, an RLVR model with two complementary components. First, we use prompt perplexity as a proxy for model adaptability in learning, and adopt a schedule to guide the model from well-understood prompts to progressively challenging ones. Second, we amplify the diversity among rollouts by differentiating their relative entropy and prioritizing sequences with greater exploratory behavior. Together, these mechanisms reduce rollout demand while preserving competitive performance. On Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3× fewer rollouts than baselines. Beyond empirical gains, we provide theoretical and in-depth analyses that explain how our method improves the data efficiency of RLVR.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: data-efficient training, LLM efficiency, NLP in resource-constrained settings

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English, Chinese

Submission Number: 4544

Loading