Isolating Stochastic Sources of Policy Divergence in Proximal Policy Optimization

Isolating Stochastic Sources of Policy Divergence in Proximal Policy Optimization

TMLR Paper9190 Authors

24 May 2026 (modified: 01 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This work presents an analysis of how Reinforcement Learning (RL) policies diverge during training, aiming to separate and investigate divergence arising from environmental randomness and from the policy’s initialization. We trained 300 policies using Proximal Policy Optimization (PPO) across three contexts in the MinAtar testbed: (1) with fixed parameter initialization, (2) with fixed sampled scenarios, and (3) with neither fixed. The resulting policies were examined for performance, feature-attribution overlaps, action disaccord, and overlaps in critical state estimates. The distributions of policies are similar across all training contexts for all examinations, except that the overlap in feature attributions increases when the initial parameters are fixed. These results show that, despite controlling for parameter initialization and the scenarios drawn from the environment, the PPO policies diverge similarly across training contexts. The results, therefore, suggest that PPO exhibits severe path dependence: the unpredictability of the final policy is deeply ingrained in PPO’s stochastic exploration-update loop. Furthermore, our investigation demonstrates that similarly trained PPO policies exhibited substantial differences in how they solved the RL tasks.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Goran_Radanovic1

Submission Number: 9190

Loading