distributional RL: Nonlinear Function
Created: July 22, 2022
Modified: August 04, 2022

distributional RL

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

There are two forms of uncertainty in value-based reinforcement learning.

Let r(τ)r(\tau) be the return from trajectory τ\tau, and

Qπ(s,a)=Eτ(s,a,s,π(s))r(τ)Q^\pi(s, a) = \mathbb{E}_{\tau \sim (s, a, s', \pi(s')\ldots)} r(\tau)

be the expected return of a trajectory that begins with (s,a)(s, a) and follows π\pi thereafter. But this is just an expectation. For any given MDP and policy π\pi (together defining a 'Markov reward process', i.e., a distribution over trajectories), the value r(τ)r(\tau) will follow some actual distribution, which will be different for different initial states and actions, and of which QπQ^\pi is only the expectation. That is, we have aleatoric uncertainty over the returns of such trajectories arising from intrinsic randomness.

In the general reinforcement learning setting, we also have epistemic uncertainty that arises from having only limited knowledge of the MDP we are in. We may fully know the policy π\pi, but we will be uncertain about its value because we have only limited experience with which to evaluate it. It also seems reasonable to speak of epistemic uncertainty around the optimal policy π\pi^* and the corresponding values QQ^* (this is in some sense mathematically equivalent to uncertainty about the MDP, since if we knew the MDP then we would also 'know' its optimal policy, up to computational concerns).

Epistemic uncertainty is closely related to exploration, in that representations of uncertainty can tell us where we should gather new experience.

The line of work known as 'distributional RL' is explicitly not concerned with epistemic uncertainty. It is only interested in representing and learning the distribution of returns r(τ)r(\tau) arising from intrinsic randomness in the MDP.

Theory

The most important theoretical result is that the distributional Bellman backup,

P(Qπ(s,a)=z)r,s,aP(s,a,rs,a,π)P(Qπ(s,a)=zrγ)=Er,s,aπP(Qπ(s,a)=zrγ)\begin{align*} \mathbb{P}(Q^\pi(s, a) = z) &\leftarrow \sum_{r, s', a'} \mathbb{P}(s', a', r | s, a, \pi) \mathbb{P}\left(Q^\pi(s', a') = \frac{z - r}{\gamma}\right)\\ &= \mathbb{E}_{r, s', a' \sim \pi} \mathbb{P}\left(Q^\pi(s', a') = \frac{z - r}{\gamma}\right)\\ \end{align*}

which we abbreviate as ZTπZZ \leftarrow \mathcal{T}^\pi Z for a set of Q-value distributions Z(zs,a)Z(z|s, a), is a contraction in the maximum Wasserstein metric

dˉp(Z1,Z2)sups,adp(Z1(s,a),Z2(s,a)).\bar{d}_p(Z_1, Z_2) \triangleq \sup_{s, a} d_p\left(Z_1(s, a), Z_2(s, a)\right).

This implies by the Banach fixed-point theorem that iterating the update ZTπZZ \leftarrow \mathcal{T}^\pi Z converges to a unique fixed point, which must be the return distribution ZπZ^\pi. So distributional policy evaluation 'just works'.

Unfortunately the situation with control (Q-learning) is not as nice: in general there is no unique optimal return distribution ZZ^* (because the optimal policy is not necessarily unique), the distributional greedy optimality operator ZTZZ \leftarrow \mathcal{T}^* Z given byTechnically there are multiple such operators, corresponding to different ways of breaking ties in the argmax; one can generalize this to allow updates in which the action a* is chosen stochastically from the set of greedy-optimal actions.

a(s)=argmaxaEzZ(s,a)[z]Z(zs,a)Es,rs,aZ(zrγs,a(s))\begin{align*} a^*(s) &= \arg\max_a \mathbb{E}_{z\sim Z(s, a)}[z] \\ Z(z | s, a) &\leftarrow \mathbb{E}_{s', r | s, a} Z\left(\frac{z - r}{\gamma} \bigg| s', a^*(s')\right) \end{align*}

is not a contraction in any metric, not all versions of this operator have a fixed point, and even the existence of a fixed point is not sufficient to guarantee convergence to it. There are some weaker convergence results: in particular we do still guarantee convergence in the expected Q-values, so we are not worse off than with regular Q-learning.

Representing value distributions

How should we represent the distribution Z(zs,a)Z(z | s, a) over Q-values?

Quantized categorical distribution (C51, Bellamere et al., 2017): The original distributional RL paper proposed to clip the rewards to [10,10][10, 10] and use a categorical distribution on 51 quantized values within that range. Unfortunately the Bellman backup doesn't preserve this distribution family, so it's necessary to project the backed-up distribution back into the family. The C51 algorithm then trains a network to minimize the KL divergence between a categorical output and this target. Unfortunately, the combination backup-projection-minimizeKL update is not a contraction mapping so we lose all of the theory.

Explicit quantiles (Dabney et al., 2018a): we can 'transpose' C51, so that instead of having the network produce probabilities of evenly-spaced point masses, we assume uniform probabilities and instead have the network output a set of locations αi\alpha_i for the point masses,

Z(zs,a)=1Ni=1NδαiZ(z | s, a) = \frac{1}{N}\sum_{i=1}^N \delta_{\alpha_i}

Given any target distribution with cumulative distribution function FF, the minimum-Wasserstein projection into this distribution family is to put the iith point mass at the corresponding quantile midpoint, F1(i0.5N)F^{-1}\left(\frac{i - 0.5}{N}\right). For example, given N=2N=2 point masses, we would put them at the 25th and 75th percentiles of the target distribution, representing the midpoints of the [0,0.5][0, 0.5] and [0.5,1.0][0.5, 1.0] probability intervals respectively. The composition of this projection with the Bellman backup is a contraction in the Wasserstein metric dˉ\bar{d}_\infty. Interestingly it is not directly a contraction in dˉp\bar{d}_p for finite pp, but the bound dˉpdˉ\bar{d}_p \le \bar{d}_\infty guarantees that we still get convergence in those metrics. In practice, we can train this network from samples using a quantile regression loss.

Implicit quantiles (Dabney et al., 2018b): instead of explicitly producing a finite set of locations, targeting specific quantiles of the return distribution, we feed the desired quantile τ\tau as an input to the network via positional embedding, so that the network effectively implements the quantile function F1F^{-1}. Now the return distribution is not limited to a collection of point masses, but can be fully universal! This is trained to minimize an expected quantile loss with respect to the uniform distribution on quantiles, which in practice we approximate by sampling τ1,,NU(0,1)\tau_{1,\ldots,N} \sim \mathcal{U}(0, 1). So the update looks like:

  • sample M values from the target return distribution
  • sample N quantile locations
  • for each quantile location, run a gradient update using the corresponding quantile loss averaged across all M target samples

Fully parameterized quantile (Yang et al., 2019): instead of trying to fit a network at all quantiles via sampling, we additionally parameterize the set of quantiles to target, using a 'fraction proposal network' to propose a specific set of quantiles. We are back to representing the return distribution as a set of point masses (now with nonuniform weights in general since the target quantiles are nonuniformly spaced), but this has the advantage that the network can learn to allocate its representational capacity to useful parts of the distribution.

References