Abstract: Match plan generation in the inverted index at Microsoft Bing is used to be based on hand-crafted rules. We formulate the generation process as a Parameterized Action MDP with sharing parameters and purpose a reinforcement learning algorithm on such formulation. We combine deterministic policy learning on discrete and continuous action spaces and several recent advances in deep reinforcement learning. For exploring in the parameterized action space, the agent outputs softmax values for discrete actions and applies Parameter Space Noise on policy network to unify the exploration direction in both spaces. We apply prioritized recurrent replay on match plan sequences and pad short match plans. We also use invertible value function rescaling and n-step return to stabilize the training. The agent is evaluated on our environment and some benchmarks. It outperforms the well-designed production match plan and beats the baselines on the benchmarks.
Code: http://github.com/zlf0625/iclr2020-code
Original Pdf: pdf
9 Replies
Loading