Created: February 22, 2022
Modified: February 22, 2022

deep RL notes

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Notes from John Schulman's Berkeley course on deep reinforcement learning, Spring 2016.

Value vs Policy-based learning

Value-based methods like Q-learning will learn a deterministic policy: once we have estimates of Q-values, the greedy policy is deterministic. Even if we have uncertainty on Q values, there is still a greedy policy that always maximizes expected future reward (though we could also imagine for example the Thompson-sampling policy, which does not do this).

Policy based methods can learn stochastic policies. If we train a neural net to output a distribution on actions, it will output a distribution on actions. In general this might put weight on multiple actions. For MDPs, this is irrelevent because there is always an optimal deterministic policy. But for POMDPs, it can be bad to act deterministically with respect to our observed state, because this ignores the uncertainty in the underlying world (e.g. david silver's "aliased gridworld" example). And anytime we're doing function approximation, we're really in a POMDP since our algorithm doesn't get to observe the true state, only a featurized representation. So a policy gradient method can better represent the uncertainty that's inherently introduced by function approximation.

Policy gradient basics

We have an objective function $\zeta$ , equal to expected discounted reward summed over time. The expectation is with respect to randomness in the transition model, which we don't know, and in our policy $\pi$ , which we do know. Write $\pi_\theta (a|s)$ as a distribution over actions parameterized by $\theta$ , maybe a neural net or something. Let $\tau=(s_0, a_0, s_1, a_1, \ldots)$ be a random variable denoting a trajectory. Then

\zeta(\theta) = E_{p_\theta(\tau)}\left[R(\tau)\right]

where $R(\tau) = \sum_{t=0}^T \gamma^t R(s_t, a_t)$ is the discounted reward over some horizon $T$ . We can make $T$ infinite, but it's simpler to start by thinking of it as finite.

We want the gradient $\frac{\partial \zeta}{\partial \theta}$ . This will in general be impossible to compute since the expectation involves a sum or integral over all possible trajectories. We can approximate it with a Monte Carlo sample, but first we need to write out more explicitly what exactly this gradient will look like:

\begin{align*} \nabla_\theta E_{p_\theta(\tau)}\left[R(\tau)\right] &= \nabla_\theta \int d\tau p_\theta(\tau) R(\tau) \\ &= \int d \tau \nabla_\theta p_\theta(\tau) R(\tau) \\ &= \int d\tau p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{ p_\theta(\tau)} R(\tau) \\ &= \int d\tau p_\theta(\tau) \left(\nabla_\theta \log p_\theta(\tau) \right)R(\tau) \\ &= E\left[R(\tau)\nabla_\theta \log p_\theta(\tau)\right]. \end{align*}

Note the neat trick here: we're taking the derivative of an expectation, with respect to the distribution governing that expectation, and it turns out that this is equivalent to the expectation of a new quantity that involves the log probability (specifically, the score function).

Now we have to consider this interior derivative $\nabla_\theta \log p_\theta(\tau)$ . Here another beautiful thing happens. Recall that $p_\theta(\tau)$ is a product of probabilities involving the environment, which determines our states, as well as the policy, which determines our actions. In the log, this becomes a sum. And the terms in the sum involving environmental probabilities vanish when we take the gradient! So we can compute gradients without a model of the environment at all! That is, we end up with

\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t)

which involves only the gradients of our policy.

We now have the policy gradient written as an expectation over things we can compute: the reward of a trajectory, times the gradient with respect to the policy of the actions taken during that trajectory. We can approximate this gradient by sampling a set of trajectories $\tau_i \sim \pi_\theta(\tau)$ from the current policy (and environment), and taking the empirical average:

\nabla_\theta \zeta(\theta) \approx \frac{1}{n}\sum_{i=1}^n R(\tau_i) \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a^{(i)}_t|s^{(i)}_t).

Variance Reduction

We now have an unbiased estimator of the gradient, but it may have variance too high to be useful. There are a few tricks we can use to reduce the variance.

Eliminating terms

Our expectation really involves two sums over time: the explicit sum over action gradients, but also the sum over rewards that is hidden inside of $R(\tau)$ . If we expanded out the explicit product of these sums, we'd have a sum over $T^2$ terms, each one giving the interaction of the policy at time $t$ with the reward at time $t'$ . Intuitively, if $t > t'$ , the action we take at time $t$ will have no effect on the reward at time $t'$ , so we should have $E[\nabla_\theta \log p(a_t|s_t) R_{t'}] = 0$ ; this allows us to drop all such terms from our sum, maintaining the same expectation while reducing variance. We can prove this formally:

\begin{align*} E[\nabla_\theta \log \pi(a_t|s_t) R_{t'}] &= E_{:t'}\left[R_{t'} E_{t':}\nabla_\theta \log \pi(a_t|s_t)\right]\\ &= E_{:t'}\left[R_{t'} 0\right] \\ &= 0 \end{align*}

where in the second step we used

\begin{align*} E_{a_t}\nabla_\theta \log \pi(a_t|s_t) &= \int da_t \pi(a_t|s_t) \nabla_\theta \log \pi(a_t|s_t)\\ &= \int da_t \pi(a_t|s_t) \frac{\nabla_\theta \pi(a_t|s_t)}{\pi(a_t|s_t)}\\ &= \int da_t \nabla_\theta \pi_\theta(a_t|s_t) \\ &= \left(\nabla_\theta \int da_t \pi_\theta(a_t|s_t) \right)\\ &= (\nabla_\theta 1) \\ &= 0 \end{align*}

This is a special case of the log-gradient trick we used above: we know that $\nabla_\theta E_{p_\theta}[f(x)] = E[\nabla_\theta \log p_\theta(x) f(x) ]$ for any $p_\theta(x)$ and $f(x)$ , so choosing $f(x)=1$ and running this backwards, we get $E[\nabla_\theta \log p_\theta(x) ] = \nabla_\theta E_{p_\theta}[1] = 0$ , equivalent to what we just derived.

This trick allows us to write the policy gradient as

\nabla_\theta \zeta = E\left[\sum_{t=0}^T \nabla_\theta \log \pi(a_t|s_t) \sum_{t'=t+1}^T R_{t'} \right]

which will have lower variance by something like a factor of 4 (we eliminated half the terms, which means we eliminated 3/4 of the interactions between terms).

Control Variates

Here is a general statistical trick. We are working with $E[f(x)]$ , and we have some $g(x)$ such that $E[g(x)] = 0$ . Then $E[f(x) - ag(x)] = E[f(x)]$ for any $a$ . But the new random variables might have different variance. We can compute the variance explicitly:

\begin{align*} Var(f(x) - ag(x)) &= E\left[(f(x) - ag(x) - Ef(x))^2\right] \\ &= E\left[f(x)^2 - 2af(x)g(x) + g(x)^2 - E[f(x)]^2 + E[f(x)]^2\right]\\ &= E\left[(f(x) - ag(x))^2 \right] \end{align*}

and try to solve for the $a$ that minimizes the variance; this has the flavor of a least-squares problem. If we have multiple functions $g_i(x)$ , we can give them all coefficients $a_i$ and again solve for the coefficients that minimize variance.

Applying this to the policy gradient, we can take any function $b(s_t)$ , and write

\nabla_\theta \zeta = E\left[\sum_{t=0}^T \nabla_\theta \log \pi(a_t|s_t) \left(\left(\sum_{t'=t+1}^T R_{t'}\right) - b(s_{t})\right) \right]

which works since the added terms are of the form $- \nabla_\theta \log \pi(a_t|s_t) b(s_{t})$ which has expectation zero by the same argument as in the previous section (changing the action at time $t$ won't affect $s_t$ since that's the state we were already in). Then we could imagine solving for the optimal $b$ to reduce the variance. There's a paper that does this (bartlett and something?) but the solution might be complicated. In practice, we can think of this as "centering" the reward function, or equivalently subtracting out the effects of past randomness. If we take $b(s_{t})$ to be an empirical estimate of the value function at $s_{t}$ (technically speaking, the finite-horizon value function with horizon $T-t'$ , assuming we're working with a finite horizon), this works well in practice.

We can think of this as subtracting out the predictable part of the reward. Suppose we sometimes get lucky and end up in a state $s_t$ , maybe call it Heaven, where any policy does well and gives us value $100$ . Then we shouldn't credit our future actions (after time $t$ ) with giving us this reward, if the reward was in fact entirely predictable from the fact that we ended up in state $s_t$ . As mentioned above, we can think of this as "subtracting out past randomness": if the environment sometimes lands us in Heaven, this is nice but once we're in Heaven we want to compare new actions on their merits, not say that any action we take is great because we are lucky enough to be taking it in Heaven.

Another way to view this is: suppose we're learning to act in Heaven, where all rewards are very high. Then any trajectory we sample will happen to do well, so we'll reinforce whatever actions occurred in those trajectories, even though those actions could have been entirely random and other trajectories that we didn't sample would have done just as well. It'd be better to only reinforce actions that let us do {\em better than average} relative to how well we already expect to do just from being in Heaven.

John hinted that there is a dual to this "subtract past randomness" approach, in which we can also subtract future randomness. This could be interesting to work out.

Advantage function formulation

We can write the performance of a policy $\zeta(\theta)$ as the sum of the performance of some other policy $\theta_\text{old}$ and an sum of terms from the advantage function $A^{\theta_\text{old}}(s, a) = Q^{\theta_\text{old}}(s, a) - V^{\theta_\text{old}}(s)$ , where the advantage function specifies the benefit of performing a specific action $a$ as opposed to just acting under policy $\theta_\text{old}$ . todo, derive this todo, talk about why this might be useful

Trust region optimization

We have a surrogate loss $L(\theta)$ whose gradient matches $\zeta$ at $\theta_\text{old}$ . Instead of just taking a single gradient step, you could imagine trying to optimize $L$ , and hoping that this will improve $\zeta$ . In general it won't; $L$ can do crazy things once you move away from the current local neighborhood. But if we stay within the local neighborhood, this can actually be a useful thing to try.

The right definition of "local neighborhood" turns out to involve the average KL divergence between action distributions under policies $\theta_\text{old}$ and $\theta$ , where the "average" here is taken over the states in the sampled trajectory (i.e. from the empirical state distribution). We want to optimize $L(\theta) + \lambda KL(\theta_\text{old} | \theta)$ , or alternately and equivalently, optimize $L(\theta)$ subject to a constraint that $KL(\theta_\text{old} | \theta) < \delta$ .

Apparently this KL objective can be derived by looking at the mean and variance of the quantity implied by the surrogate loss. That is, the surrogate loss is an expectation over an importance-weighted set of advantage functions

L(\theta) = E\left[\sum_t \frac{p(a|s; \theta)}{p(a|s; \theta_\text{old})} A(s,a)\right]

(or something like this, todo actually derive) and so the quantity inside the expectation is a random variable that has a mean and a variance. When the importance weighting gets very unreliable the variance will get high. So there's maybe some kind of bias-variance tradeoff. But also this KL objective can be derived by directly bounding $L$ in terms of $\zeta$ , apparently. And doing it this way you can guarantee that, for the right choice of $\lambda$ , the "KL-regularized" objective $L+KL$ is a lower bound on the true objective $\zeta$ . Which means that you get guarantees similar to EM, where optimizing the lower bound objective is guaranteed to improve your actual objective by at least the same amount.

There's a cool-seeming connection to the natural gradient. If we just did policy gradient, we'd take gradient steps $g$ . The natural gradient takes steps $F^{-1}g$ where $F$ is the Fisher information matrix. But this is (apparently! todo derive) the Hessian of the KL divergence. So we if further approximate our surrogate loss by approximating $L$ as linear and $KL$ as quadratic, then the solution to the surrogate optimization problem is actually a natural gradient step! Which is great because in general the natural gradient is not tractable to compute directly. But also optimizing the actual surrogate function is apparently better in practice than optimizing the linear/quadratic approximation.

There's another cool connection to policy iteration. In the case of a tabular policy, optimizing the un-penalized surrogate loss $L$ is exactly equivalent to a policy iteration step. This is easy to see since the loss can be written as a sum of advantage functions weighted by probabilities of actions, and the optimum is just to put all of the weight on the actions with the highest advantage functions, which is exactly what the policy iteration update does.

Both the policy iteration connection, and the natural gradient connection, come from Sham Kakade (2001, 2002, probably both are in his thesis?).

Deep Q Learning

tricks:

experience replay. keep a library of trajectories and sample randomly which to use in each training iteration.
freezing. do supervised regression against a frozen version of the q-value network, not against the current version (which is super unstable)

Q learning is less inherently stable than policy gradient, there's no guarantee of convergence and often it diverges. But it also doesn't have the same local maxima issues. Of course it can get stuck in local maxima, but because it uses the problem structure more globally, they tend not to be quite as stupid.

Supervised Learning

Suppose we have access to the actions of an expert. This can be a human (in an IRL-type setting) or it can be a search algorithm in a simulator. In the latter case, we suppose we have some ability to throw computation at A*, Monte Carlo Tree Search, trajectory optimization, whatever, and get some idea of the optimal action from a particular state. But we'd like to learn a policy that does the right thing without needing all of this computation.

DAGGER: train a supervised classifier on actions.

AGGREVATE: train a supervised regressor on the advantage function (if our expert makes this available).

The difficulty here is that we want to match the test-train distribution, so we want expert training on the actual states we'll encounter in practice, running our learned policy, NOT the states that the experts would encounter along their optimal policies. So the idea is to start with a bad policy, sample trajectories, get expert actions for each state encountered along those trajectories, train a new policy, and iterate.

Exploration

In the bandit case, UCB and Thompson sampling seem useful. Regret analysis is typical.

For the case of finite-state MDPs, there is a notion of PAC-MDP exploration, in which we are restricted to choosing actions via a computation that runs in polynomial time (here "polynomial" in the size of the MDP, so this includes being able to solve the entire MDP at every step…), and the goal is to get an $\varepsilon$ -optimal policy a polynomial number of samples with probability $1-\delta$ , where $\varepsilon$ -optimal means that the value at every state is within $\varepsilon$ of the value of the optimal policy. Algorithms that can do this include Delayed Q Learning, which gives some kind of bonus to Q values that encounter new states, and RMAX, which adds extra reward to states that haven't been visited much.

This general theme of optimism comes up a lot - in UCB, in these algorithms for PAC-MDP learning, and in other algorithms and heuristics.

Difficult cases for exploration are where we might need to move through a ton of unrewarding states to get to the reward. If we're doing epsilon greedy or something, that means we need to flip epsilon a bunch of times in a row, which is exponentially unlikely.

There are also cases that seem objectively easy but are difficult for policy gradient methods. Suppose a state has two actions, one of which provides 1 reward and the other -999 reward. If we start with a uniform distribution over the policy, we'll quickly learn that the state has low expected value and never visit it again. But it's actually a positive-reward state as long as we know which action to take! This represents a lot of real life situations, where states are good only if you have expertise. Like driving a car on the highway is very useful, but until you know how to drive, taking a car onto the highway is actually really bad. So exploration somehow needs to first learn the skill of driving, rather than just heading straight for the highway and immediately getting turned off from the whole endeavor when you crash.

Exploration in Deep RL: there is Thompson sampling, either through dropout on a Q network or from bootstrapping to learn an explicit ensemble of Q networks, where we choose one network from the ensemble to act in each episode.

There's also work on continuous-state notions of incentivizing unvisited states, using a distance metric or whatever. (this feels related to a GP on the state space, which can implement Thompson sampling, but has the property that it'll be more confident when near previously visited states).

There's also a lot of work on intrinsic motivation. This is independent of any particular reward function, and says that it might be helpful to try to maximize curiosity-based objectives, such as

prediction error / surprise of observations
information gain about the model
mutual information between action sequence and future state (empowerment). These are related ideas to exploration, but different since there's no particular reward function.

Note that there always is an optimal exploration policy. That is, we can always consider the POMDP whose state is a belief distribution over MDPs. The optimal policy in this POMDP gives the optimal exploration policy.

Dylan points out that exploration is kind of the opposite of safety. A safe system should stay close to the set of previously explored states and not drastically change its behavior, while an exploratory system does the opposite. One interpretation of this is that failure to explore is a kind of regularization. I personally am not quite convinced by this: it seems like safety is a more fine-grained property than exploration, and it makes sense to treat them separately. But there is a connection.

Another thought is that exploration is hard in the same sense that global optimization is hard. In both cases the problem is mostly about avoiding local minima. Exploration is explicitly the task of not doing the immediately greedy thing. For this reason it seems that optimal exploration must be NP-hard or #P-hard or whatever, for the same reason that general nonconvex optimization is hard. You can always define an MDP with an exponential number of states, only one of which gives any reward, and the only way to learn that is to visit them all.