Modified: March 22, 2022
monte carlo tree search
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.A very natural form of meta-reasoning that selects the most promising computations.
The simplest form of 'expanding' a node assumes a game tree with discrete actions, deterministic dynamics, and fully observed state.
- What if we don't fully observe the state? We can instead work in the belief MDP, using the (action, observation) sequence as a sufficient statistic for beliefs. If the underlying dynamics are deterministic, then the action sequence on its own is sufficient.
Tree policies
The standard tree policy is UCT: at each node, choose the action that maximizes an upper confidence bound on the Q-value:
My understanding is that in multiplayer games the relevant -value is the value for the current player. So each player is trying to win their own game.
This formulation is specifically for games with binary reward, where the s model the win probabilities, and we can (I assume) derive the UCB update from something like a Beta-Bernoulli posterior quantile. In other contexts we'd need to specify an appropriate model family for the reward. In general, exploration requires us to model the distribution of Q-values (distributional RL) rather than just the expected value.
You could imagine using other bandit algorithms, like Thompson sampling, where we'd sample rewards from the posterior distribution (computed under some model), then expand the action with the largest reward.
Good models of Q-values and their uncertainties make MCTS work really well.
You could also imagine other objectives than pure banditry:
- What if we think of expansion as a single bandit problem, rather than a succession of them? Each bandit arm is an entire action sequence (path to a leaf), so there are exponentially many of them, but at least in principle we can run UCB on this, back up the Q-values from the leaves to get an estimated payoff for each of them. When would this behave differently from the naive sequential bandits?
Suppose that deep in the tree we find a leaf that samples a reward 100x bigger than anything we've seen before. But suppose that so this will barely change the expected returns for the root-level actions. So a naive sequential bandit won't use this information. But a trajectory bandit in this situation would almost certainly revisit the high-value trajectory it just saw.
- Now why isn't the sequential process good enough in this case? Some possibilities:
- Q model at the root is misspecified such that it doesn't update to reflect the big potential upside for one of the two root actions.
- Posterior updating in an iid model is the wrong way to learn the distribution, because the observed rewards are not iid. Each one is from an expanded version of the previous tree; on average, you'd expect them to go up over time. Given this, we should be modeling the value explicitly, and doing Thompson sampling on those posteriors.
- Now why isn't the sequential process good enough in this case? Some possibilities:
Backpropagation
Review papers:
- Świechowski et al. 2021: https://arxiv.org/abs/2103.04931
- Gelly at al. 2012: The Grand Challenge of Computer Go: Monte Carlo
Tree Search and Extensions
Deepmind implementation: https://github.com/deepmind/mctx