Abstract: A popular approach for online decision making in large MDPs is time-bounded tree search. The effectiveness of tree search, however, is largely influenced by the action branching factor, which limits the search depth given a time bound. An obvious way to reduce action branching is to consider only a subset of potentially good ac- tions at each state as specified by a provided partial policy. In this work, we consider offline learning of such partial policies with the goal of speeding up search without significantly reduc- ing decision-making quality. Our first contribu- tion is to study learning algorithms based on re- ducing our learning problem to i.i.d. supervised learning. We give a reduction-style analysis of three such algorithms, each making different as- sumptions, which relates the supervised learning objectives to the sub-optimality of search using the learned partial policies. Our second contribu- tion is to describe concrete implementations of the algorithms within the popular framework of Monte-Carlo tree search. Finally, the third con- tribution is to evaluate the learning algorithms in two challenging MDPs with large action branch- ing factors, showing that the learned partial poli- cies can significantly improve the anytime per- formance of Monte-Carlo tree search.
0 Replies
Loading