Abstract: We consider the problem of online parameter estimation using maximum likelihood for Partially Observed Markov Decision Processes (POMDPs). Classical approaches based upon EM algorithm are for the episodic framework and are not suited for this problem. We develop a methodology based upon extremum seeking to develop a multiple time scale iterative scheme for the online problem and study it both theoretically and with numerical experiments.
Loading