Abstract: We present a learning algorithm for players to converge to their stationary policies in a general sum stochastic sequential Stackelberg game. The algorithm is a two time scale implicit policy gradient algorithm that provably converges to stationary points of the optimization problems of the two players. Our analysis allows us to move beyond the assumptions of zero-sum or static Stackelberg games made in the existing literature for learning algorithms to converge.
Loading