- Keywords: alphazero, reinforcement learning, two-player games, heuristic search, deep neural networks
- TL;DR: An empirical study of three-head architecture for AlphaZero learning
- Abstract: The search-based reinforcement learning algorithm AlphaZero has been used as a general method for mastering two-player games Go, chess and Shogi. One crucial ingredient in AlphaZero (and its predecessor AlphaGo Zero) is the two-head network architecture that outputs two estimates --- policy and value --- for one input game state. The merit of such an architecture is that letting policy and value learning share the same representation substantially improved generalization of the neural net. A three-head network architecture has been recently proposed that can learn a third action-value head on a fixed dataset the same as for two-head net. Also, using the action-value head in Monte Carlo tree search (MCTS) improved the search efficiency. However, effectiveness of the three-head network has not been investigated in an AlphaZero style learning paradigm. In this paper, using the game of Hex as a test domain, we conduct an empirical study of the three-head network architecture in AlpahZero learning. We show that the architecture is also advantageous at the zero-style iterative learning. Specifically, we find that three-head network can induce the following benefits: (1) learning can become faster as search takes advantage of the additional action-value head; (2) better prediction results than two-head architecture can be achieved when using additional action-value learning as an auxiliary task.