Learning Partial Policies to Speedup MDP Tree Search via Reduction to I.I.D. Learning; Jervis Pinto, Alan Fern
A popular approach for online decision-making in large MDPs is
time-bounded tree search. The effectiveness of tree search,
however, is largely influenced by the action branching factor,
which limits the search depth given a time bound. An obvious way
to reduce action branching is to consider only a subset of
potentially good actions at each state as specified by a
provided partial policy. In this work, we consider offline
learning of such partial policies with the goal of speeding up
search without significantly reducing decision-making quality.
Our first contribution consists of reducing the learning problem
to set learning. We give a reduction-style analysis of three
such algorithms, each making different assumptions, which
relates the set learning objectives to the sub-optimality of
search using the learned partial policies. Our second
contribution is to describe concrete implementations of the
algorithms within the popular framework of Monte-Carlo tree
search. Finally, the third contribution is to evaluate the
learning algorithms on two challenging MDPs with large action
branching factors. The results show that the learned partial
policies can significantly improve the anytime performance of
Monte-Carlo tree search.
source: jmlr.csail.mit.edu