Stochastic Differential Policy Optimization: A Rough Path Approach to Reinforcement Learning

Minh Phuong Nguyen; Chandrajit L. Bajaj

Stochastic Differential Policy Optimization: A Rough Path Approach to Reinforcement Learning

Minh Phuong Nguyen, Chandrajit L. Bajaj

Published: 28 Jun 2025, Last Modified: 28 Jun 2025TASC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Stochastic Control, Reinforcement Learning, Rough Path Theory, Pontryagin Maximum Principle, Operator Learning

TL;DR: We extend differential RL to the stochastic setting with theoretical results on convergence, sample complexity, and regret.

Abstract: We extend Differential Policy Optimization (DPO) to stochastic settings by deriving a discrete-time algorithm from the stochastic Pontryagin Maximum Principle using rough path theory. The framework preserves DPO's operator-based structure while incorporating stochasticity via Brownian and second-level rough path increments. We prove pointwise convergence, establish sample complexity bounds, and derive a regret bound of $O(K^{5/6})$. This provides a theoretically grounded approach to policy learning in continuous-time stochastic control settings.

Submission Number: 10

Loading