Keywords: metareasoning, reinforcement learning, MCTS, tree search, search, decision-time planning
TL;DR: Tree search algorithms perform better when they use more compute, but using too much compute is costly. We model that cost, and modify a tree-search algorithm so it can use a variable amount of compute.
Abstract: Decision-time planning (DTP) agents use significant amounts of time and compute to search before taking an action. Such agents have been instrumental in achieving strong performance in domains like chess, go, and poker. Some DTP agents, like AlphaZero, are trained via reinforcement learning. However, the learning objectives that these agents optimize for do not typically include the cost of using time and compute. Instead, algorithm designers control the trade-off between performance and compute cost by tuning coarse-grained hyperparameters or writing hardcoded heuristics.
In this work, we introduce Dynamic Thinker, which is a modification of an existing state-of-the-art tree-search agent, Thinker. Dynamic Thinker is capable of optimizing its DTP behavior, under objective functions which account for the cost of computation. We design such objective functions for some toy environments, and show that Dynamic Thinker outperforms Thinker and AlphaZero.
Qualitatively, we find that Dynamic Thinker performs well by learning to use compute resources efficiently. We also highlight Dynamic Thinker's interesting emergent behavior, such as using more search in the start of the episode in one environment, and using more search near the end of the episode in another.
Type Of Paper: Full paper (max page 8)
Anonymous Submission: Anonymized submission.
Submission Number: 44
Loading