Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning

TMLR Paper7549 Authors

17 Feb 2026 (modified: 04 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: EiC revision: changed the submission type to Long on the AE's recommendation, given that some aspects of the Appendix must be reviewed to assess the submission thoroughly.
Assigned Action Editor: ~Amir-massoud_Farahmand1
Submission Number: 7549
Loading