Created: July 22, 2022
Modified: August 04, 2022

distributional RL

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

There are two forms of uncertainty in value-based reinforcement learning.

Let $r(\tau)$ be the return from trajectory $\tau$ , and

Q^\pi(s, a) = \mathbb{E}_{\tau \sim (s, a, s', \pi(s')\ldots)} r(\tau)

be the expected return of a trajectory that begins with $(s, a)$ and follows $\pi$ thereafter. But this is just an expectation. For any given MDP and policy $\pi$ (together defining a 'Markov reward process', i.e., a distribution over trajectories), the value $r(\tau)$ will follow some actual distribution, which will be different for different initial states and actions, and of which $Q^\pi$ is only the expectation. That is, we have aleatoric uncertainty over the returns of such trajectories arising from intrinsic randomness.

In the general reinforcement learning setting, we also have epistemic uncertainty that arises from having only limited knowledge of the MDP we are in. We may fully know the policy $\pi$ , but we will be uncertain about its value because we have only limited experience with which to evaluate it. It also seems reasonable to speak of epistemic uncertainty around the optimal policy $\pi^*$ and the corresponding values $Q^*$ (this is in some sense mathematically equivalent to uncertainty about the MDP, since if we knew the MDP then we would also 'know' its optimal policy, up to computational concerns).

Epistemic uncertainty is closely related to exploration, in that representations of uncertainty can tell us where we should gather new experience.

The line of work known as 'distributional RL' is explicitly not concerned with epistemic uncertainty. It is only interested in representing and learning the distribution of returns $r(\tau)$ arising from intrinsic randomness in the MDP.

Theory

The most important theoretical result is that the distributional Bellman backup,

\begin{align*} \mathbb{P}(Q^\pi(s, a) = z) &\leftarrow \sum_{r, s', a'} \mathbb{P}(s', a', r | s, a, \pi) \mathbb{P}\left(Q^\pi(s', a') = \frac{z - r}{\gamma}\right)\\ &= \mathbb{E}_{r, s', a' \sim \pi} \mathbb{P}\left(Q^\pi(s', a') = \frac{z - r}{\gamma}\right)\\ \end{align*}

which we abbreviate as $Z \leftarrow \mathcal{T}^\pi Z$ for a set of Q-value distributions $Z(z|s, a)$ , is a contraction in the maximum Wasserstein metric

\bar{d}_p(Z_1, Z_2) \triangleq \sup_{s, a} d_p\left(Z_1(s, a), Z_2(s, a)\right).

This implies by the Banach fixed-point theorem that iterating the update $Z \leftarrow \mathcal{T}^\pi Z$ converges to a unique fixed point, which must be the return distribution $Z^\pi$ . So distributional policy evaluation 'just works'.

Unfortunately the situation with control (Q-learning) is not as nice: in general there is no unique optimal return distribution $Z^*$ (because the optimal policy is not necessarily unique), the distributional greedy optimality operator $Z \leftarrow \mathcal{T}^* Z$ given byTechnically there are multiple such operators, corresponding to different ways of breaking ties in the argmax; one can generalize this to allow updates in which the action a* is chosen stochastically from the set of greedy-optimal actions.

\begin{align*} a^*(s) &= \arg\max_a \mathbb{E}_{z\sim Z(s, a)}[z] \\ Z(z | s, a) &\leftarrow \mathbb{E}_{s', r | s, a} Z\left(\frac{z - r}{\gamma} \bigg| s', a^*(s')\right) \end{align*}

is not a contraction in any metric, not all versions of this operator have a fixed point, and even the existence of a fixed point is not sufficient to guarantee convergence to it. There are some weaker convergence results: in particular we do still guarantee convergence in the expected Q-values, so we are not worse off than with regular Q-learning.

Representing value distributions

How should we represent the distribution $Z(z | s, a)$ over Q-values?

Quantized categorical distribution (C51, Bellamere et al., 2017): The original distributional RL paper proposed to clip the rewards to $[10, 10]$ and use a categorical distribution on 51 quantized values within that range. Unfortunately the Bellman backup doesn't preserve this distribution family, so it's necessary to project the backed-up distribution back into the family. The C51 algorithm then trains a network to minimize the KL divergence between a categorical output and this target. Unfortunately, the combination backup-projection-minimizeKL update is not a contraction mapping so we lose all of the theory.

Explicit quantiles (Dabney et al., 2018a): we can 'transpose' C51, so that instead of having the network produce probabilities of evenly-spaced point masses, we assume uniform probabilities and instead have the network output a set of locations $\alpha_i$ for the point masses,

Z(z | s, a) = \frac{1}{N}\sum_{i=1}^N \delta_{\alpha_i}

Given any target distribution with cumulative distribution function $F$ , the minimum-Wasserstein projection into this distribution family is to put the $i$ th point mass at the corresponding quantile midpoint, $F^{-1}\left(\frac{i - 0.5}{N}\right)$ . For example, given $N=2$ point masses, we would put them at the 25th and 75th percentiles of the target distribution, representing the midpoints of the $[0, 0.5]$ and $[0.5, 1.0]$ probability intervals respectively. The composition of this projection with the Bellman backup is a contraction in the Wasserstein metric $\bar{d}_\infty$ . Interestingly it is not directly a contraction in $\bar{d}_p$ for finite $p$ , but the bound $\bar{d}_p \le \bar{d}_\infty$ guarantees that we still get convergence in those metrics. In practice, we can train this network from samples using a quantile regression loss.

Implicit quantiles (Dabney et al., 2018b): instead of explicitly producing a finite set of locations, targeting specific quantiles of the return distribution, we feed the desired quantile $\tau$ as an input to the network via positional embedding, so that the network effectively implements the quantile function $F^{-1}$ . Now the return distribution is not limited to a collection of point masses, but can be fully universal! This is trained to minimize an expected quantile loss with respect to the uniform distribution on quantiles, which in practice we approximate by sampling $\tau_{1,\ldots,N} \sim \mathcal{U}(0, 1)$ . So the update looks like:

sample M values from the target return distribution
sample N quantile locations
for each quantile location, run a gradient update using the corresponding quantile loss averaged across all M target samples

Fully parameterized quantile (Yang et al., 2019): instead of trying to fit a network at all quantiles via sampling, we additionally parameterize the set of quantiles to target, using a 'fraction proposal network' to propose a specific set of quantiles. We are back to representing the return distribution as a set of point masses (now with nonuniform weights in general since the target quantiles are nonuniformly spaced), but this has the advantage that the network can learn to allocate its representational capacity to useful parts of the distribution.

References

Bellemare, M.G., Dabney, W. & Munos, R.. (2017). A Distributional Perspective on Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:449-458
Dabney, Will, Mark Rowland, Marc Bellemare, and Rémi Munos.Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1. 2018a.
Dabney, Will, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp. 1096-1105. PMLR, 2018b.
Yang, Derek, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. Advances in neural information processing systems 32 (2019).

distributional RL

Theory

Representing value distributions

References

Links to this note

deep deterministic policy gradient

values all the way down

monte carlo tree search

Meta