What Reward Structure Enables Efficient Sparse-Reward RL? A Proof-of-Concept with Policy-Aware Matrix Completion

12 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning
Abstract: Sparse-reward reinforcement learning typically focuses on exploration, but we ask: can structural assumptions about reward functions themselves accelerate learning? We introduce \textbf{Policy-Aware Matrix Completion (PAMC)}, which exploits low-rank structure in reward matrices while correcting for policy-induced sampling bias. PAMC combines three key components: (i) a low-rank plus sparse reward model, (ii) inverse propensity weighting to handle Missing-Not-At-Random (MNAR) data, and (iii) confidence-gated abstention that falls back to intrinsic exploration when uncertain. We provide finite-sample theory showing that completion error scales as $O(\sigma\sqrt{r(|\mathcal{S}|+|\mathcal{A}|)/\text{ESS}})$ where ESS is the effective sample size under policy overlap $\kappa$. PAMC achieves strong empirical results: 4100$\pm$250 return vs. 200$\pm$50 for DrQ-v2 on Montezuma's Revenge, 78\% vs. 65\% success rate on MetaWorld-50, and 15\% improvement over CQL on D4RL datasets. The method maintains 8\% computational overhead while providing calibrated confidence intervals (95\% empirical coverage). When structural assumptions are violated, PAMC gracefully degrades through increased abstention rather than catastrophic failure. Our approach demonstrates that reward structure exploitation can complement traditional exploration methods in sparse-reward domains.
Primary Area: reinforcement learning
Submission Number: 4574
Loading