Grounding State Representations in Sensory Experience for Reasoning and Planning by Mobile Robots

Daniel Nikovski

2000 (modified: 16 Jul 2019)AAAI/IAAI 2000Readers: Everyone

Abstract: We are addressing the problem of learning probabilistic models of the interaction between a mobile robot and its environment and using these models for task planning. This requires modifying the state-of-the-art reinforcement learning algorithms to deal with hidden state and high-dimensional observation spaces of continuous variables. Our approach is to identify hidden states by means of the trajectories leading into and out of them, and perform clustering in this embedding trajectory space in order to compile a partially observable Markov decision process (POMDP) model, which can be used for approximate decision-theoretic planning. The ultimate objective of our work is to develop algorithms that learn POMDP models with discrete hidden states defined (grounded) directly into continuous sensory variables such as sonar and infrared readings. Mobile robots often have to reason and plan their course of action in unknown, non-deterministic, and partially observable environments. Acquiring a model of the environment that reflects the effect of the robot’s actions is essential in such cases. The framework of partially observable Markov decision processes (POMDPs) is especially suitable for representing stochastic models of dynamic systems. Reasoning, planning, and learning in fully-observable state spaces is relatively well understood, and finding optimal policies for problems with huge numbers of states is now feasible. On the contrary, planning and learning in problems with partially-observable state spaces have proven to be very difficult. Finding optimal policies for POMDPs has been shown to be PSPACE-hard and currently solvable problems rarely have more than a hundred states. Learning POMDPs is harder still ‐ even for problems with discrete state spaces and discrete observations, the number of states for the largest problems solved are in the dozens. The usual approach to learning POMDPs has been to unfold the model in time and employ a general method for learning probabilistic networks such as Baum-Welch (BW), which tunes the parameters of the model so as to increase the likelihood of the training data given these model parameters. Such algorithms often get trapped in shallow local maxima of the optimization surface and can rarely recover the true

0 Replies