Abstract: First order policy optimization has been
widely used in reinforcement learning. It guarantees to find
the optimal policy for the state-feedback linear quadratic
regulator (LQR). However, the performance of policy
optimization remains unclear for the linear quadratic
Gaussian (LQG) control where the LQG cost has spurious
suboptimal stationary points. In this paper, we introduce a
novel perturbed policy gradient (PGD) method to escape a
large class of bad stationary points (including high-order
saddles). In particular, based on the specific structure of
LQG, we introduce a novel reparameterization procedure
which converts the iterate from a high-order saddle to a
strict saddle, from which standard random perturbations
in PGD can escape efficiently. We further characterize the
high-order saddles that can be escaped by our algorithm.
0 Replies
Loading