Learning parameterized policies for Markov decision processes through demonstrations

Manjesh Kumar Hanawal, Hao Liu, Henghui Zhu, Ioannis Ch. Paschalidis

2016 (modified: 19 May 2025)CDC 2016Readers: Everyone

Abstract: We consider the problem of learning a policy used by an agent in a Markov decision process using state-action samples. We focus on a class of parameterized policies and use ℓ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> -regularized logistic regression to train a policy that best fits the observed state-action pairs (demonstrations). We bound the difference in average reward of the trained and the original policy (regret) in terms of the generalization error and sensitivity parameters of the Markov chain. Specifically, we use techniques from sample complexity theory to relate regret to the generalization error and techniques from sensitivity analysis of the stationary distribution of Markov chains to relate regret to the ergodic coefficient of the Markov chain. We demonstrate the effectiveness of our method on a synthetic example.

0 Replies