monte carlo tree search: Nonlinear Function
Created: March 15, 2022
Modified: March 22, 2022

monte carlo tree search

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A very natural form of meta-reasoning that selects the most promising computations.

The simplest form of 'expanding' a node assumes a game tree with discrete actions, deterministic dynamics, and fully observed state.

  • What if we don't fully observe the state? We can instead work in the belief MDP, using the (action, observation) sequence as a sufficient statistic for beliefs. If the underlying dynamics are deterministic, then the action sequence on its own is sufficient.

Tree policies

The standard tree policy is UCT: at each node, choose the action that maximizes an upper confidence bound on the Q-value:

argmaxaAQ(s,a)+ClnN(s)N(s,a)\text{arg}\max_{a\in\mathcal{A}} Q(s, a) + C \cdot \sqrt{\frac{\ln N(s)}{N(s, a)}}

My understanding is that in multiplayer games the relevant QQ-value is the value for the current player. So each player is trying to win their own game.

This formulation is specifically for games with binary reward, where the QQs model the win probabilities, and we can (I assume) derive the UCB update from something like a Beta-Bernoulli posterior quantile. In other contexts we'd need to specify an appropriate model family for the reward. In general, exploration requires us to model the distribution of Q-values (distributional RL) rather than just the expected value.

You could imagine using other bandit algorithms, like Thompson sampling, where we'd sample rewards from the posterior distribution (computed under some model), then expand the action with the largest reward.

Good models of Q-values and their uncertainties make MCTS work really well.

You could also imagine other objectives than pure banditry:

  • What if we think of expansion as a single bandit problem, rather than a succession of them? Each bandit arm is an entire action sequence (path to a leaf), so there are exponentially many of them, but at least in principle we can run UCB on this, back up the Q-values from the leaves to get an estimated payoff for each of them. When would this behave differently from the naive sequential bandits? Suppose that deep in the tree we find a leaf that samples a reward 100x bigger than anything we've seen before. But suppose that N=100000N=100000 so this will barely change the expected returns for the root-level actions. So a naive sequential bandit won't use this information. But a trajectory bandit in this situation would almost certainly revisit the high-value trajectory it just saw.
    • Now why isn't the sequential process good enough in this case? Some possibilities:
      • Q model at the root is misspecified such that it doesn't update to reflect the big potential upside for one of the two root actions.
      • Posterior updating in an iid model is the wrong way to learn the QQ distribution, because the observed rewards are not iid. Each one is from an expanded version of the previous tree; on average, you'd expect them to go up over time. Given this, we should be modeling the QQ^* value explicitly, and doing Thompson sampling on those posteriors.

Backpropagation

Review papers:

Deepmind implementation: https://github.com/deepmind/mctx