Abstract: Learning the underlying Markovian dynamics of an environment, from partial observations, is a key first step towards model-based reinforcement learning. Considering the environment as a Partially Observable Markov Decision Process (POMDP), state representations are typically inferred from the history of past observations and actions. Instead, we design a Dynamical Variational Auto-Encoder (DVAE) to learn causal Markovian dynamics from offline trajectories in a factored-POMDP setting. In doing so, we derive that incorporating future information is essential to accurately capture causal dynamics and the underlying Markovian states. Our method employs an extended hindsight framework that integrates past, current, and multi-step future information, to infer hidden factors in a principled way, while simultaneously learning transition dynamics as a structural causal model. Our framework is derived from maximizing the log-likelihood of complete trajectories factorized in time and state. Empirical results in a 1-hidden factored-POMDP setting, reveal that this approach uncovers the hidden factor up to a simple transformation, as well as the transition model and causal graph, more effectively than history based, typical 1-step hindsight based, and full trajectory bidirectional-RNN-based models.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Minor correction to Eqn 2.
Assigned Action Editor: ~Jaakko_Peltonen1
Submission Number: 4096
Loading