Quantifying First‐Order Markov Breakdowns in Noisy Reinforcement Learning: A Causal Discovery Approach

Naveen Mysore

Quantifying First‐Order Markov Breakdowns in Noisy Reinforcement Learning: A Causal Discovery Approach

Naveen Mysore

11 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Markov Property, PCMCI (Causal Discovery), PPO, Noisy Reinforcement Learning.

TL;DR: Uses PCMCI-based causal discovery to introduce a Markov Violation Score that measures how noise or partial observability breaks the Markov property in reinforcement learning tasks.

Abstract: Reinforcement learning (RL) methods often assume that each new observation fully captures the environment’s state, ensuring Markovian (one‐step) transitions. Real‐world deployments, however, frequently violate this assumption due to partial observability or noise in sensors and actuators. This paper introduces a systematic methodology for diagnosing such violations, combining a partial correlation based causal discovery procedure (PCMCI) with a newly proposed Markov Violation score (MVS). The MVS quantifies multi‐step dependencies that emerge when noise or incomplete state information disrupts the Markov property. Classic control tasks (CartPole, Pendulum, Acrobot) are used to assess how targeted noise and dimension omissions affect both RL performance and the measured Markov consistency. Contrary to expectations, heavy observation noise often fails to induce strong multi‐lag dependencies in certain tasks (e.g., Acrobot). Dimension‐dropping experiments further reveal that omitting certain state variables (e.g., angular velocities in CartPole and Pendulum) substantially degrades returns and elevates MVS, while other dimensions can be removed with negligible effect. These findings highlight the importance of identifying and safeguarding the most causally critical dimensions to maintain effective one‐step learning. By bridging partial correlation tests and RL performance metrics, the proposed approach uniquely pinpoints when and where the Markov property breaks. This framework offers a principled tool for designing robust policies, guiding representation learning, and handling partial observability in real‐world RL tasks. All code and experimental logs are publicly available for reproducibility (URL omitted for double‐blind review).

Supplementary Material: zip

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 24338

Loading