deep deterministic policy gradient: Nonlinear Function
Created: July 22, 2022
Modified: July 27, 2022

deep deterministic policy gradient

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Deep deterministic policy gradient (DDPG) is an interesting RL algorithm with a somewhat misleading name. Although its name indicates that it's a policy gradient method, it's 'really' more like an adaptation of deep Q-learning to continuous action spaces.

To evaluate and optimize the policy gradient objective

J(θ)=EsDπ,aπθ(s)Qπθ(s,a)J(\theta) = E_{s\sim\mathcal{D}_\pi, a\sim \pi_\theta(\cdot|s)} Q^{\pi_\theta}(s, a)

we have to somehow estimate the values Qπθ(s,a)Q^{\pi_\theta}(s, a). Broadly, there are two approaches: we can use real-world experience, or we can use a model.

On-policy ('experience') methods

Classically policy-gradient methods estimate value from real-world experience, using either

  1. the unbiased Monte Carlo estimator, that is, the sum of discounted rewards actually received in an on-policy trajectory from (s,a)(s, a), or
  2. a temporal difference estimator, in which we look at only some finite number of transitions (or some mixture of finite numbers of transitions) following (s,a)(s, a) and replace the remaining terms with some estimate V(sk)V(s_{k}) for the value of the state we've reached at that point.

These rely on sampling transitions from the environment, giving rise to the usual characteristics of policy-gradient methods:

  1. They are 'on-policy' - each evaluation (and thus each update) requires new experience collected from the environment, which is expensive.
  2. We must use the score function as a gradient estimator, so gradient variance is a huge concern. We'll likely train a value function baseline as a 'critic' to reduce variance.
  3. They work equally well with continuous and discrete action spaces - all that matters is that the action fits the 'type signature' expected by the environment.
  4. As stochastic gradient descent methods, they are stable and will converge to a local optimum (as long as we have unbiased estimates of the gradient, which we can approach by using a long enough TD horizon).

Off-policy ('surrogate') methods

We can avoid the need for real-word experience directly in the policy gradient update by instead using a model Qϕ(s,a)Q_\phi(s, a), that we've trained to approximate the state-action values Qπθ(s,a)Q^{\pi_\theta}(s, a). This is almost like Bayesian optimization: since evaluating the real-world objective (the value of a policy) and estimating its gradient is expensive, we learn a surrogate objective instead, and then optimize that.

A policy update that doesn't require real-world experience has different characteristics:

  1. it is, by definition, 'off-policy' - we're following the gradient of some surrogate objective, an artifact that has been trained to distill real-world experience, which exists and is valid independently of any particular policy we use to query it.
  2. It can use backprop for proper gradients, if the action space is continuous so that the reparameterization trick applies, or discrete and small so that we can take the expectation exactly.
  3. The policies it finds will only ever be as good as its proxy objective. If some policy would work great in the real world but the proxy objective somehow fails to represent this (e.g. because it's represented by something like a deep network that has limited capacity), that policy will never be found.
  4. Since the proxy objective itself depends on experience, which we must gather somehow (presumably using a current policy), we have to worry about choosing policies that will be good exploratory policies. If you just chose the expected-best policies with no regard for exploration, then your surrogate will end up being ignorant about large parts of the space, but your optimizer won't know this, and bad things will happen.

Put differently, the policy in this setting is functioning simply as a distillation or an amortized optimization of the Q function (the prototypical 'surrogate objective'). The actual learning --- the interaction with the real world --- takes place through some Q-learning process, and the policy just exists as a computational convenience to avoid us needing to run an optimizer on the Q function every time we want to act.

DDPG is just the simplest possible instantiation of this framework, where we use a QQ estimator as the surrogate objective and consider the class of deterministic policies.

The best of both worlds?

Suppose we have a surrogate value function QϕQ_\phi learned from off-policy experience, but ultimately we want to maximize the real value function QπQ^\pi. We can combine approaches by using the surrogate as an action-dependent control variate for an unbiased policy gradient update. That is, to optimize

J(θ)=EsD,aπθ[Qπθ(s,a)]J(\theta) = \mathbb{E}_{s\sim \mathcal{D}, a\sim\pi_\theta} \left[Q^{\pi_\theta}(s, a)\right]

we rewrite the objective as

J(θ)=EsD,aπθ[Qπθ(s,a)Qϕ(s,a)+Qϕ(s,a)]J(\theta) = \mathbb{E}_{s\sim \mathcal{D}, a\sim\pi_\theta} \left[Q^{\pi_\theta}(s, a) - Q_\phi(s, a) + Q_\phi(s, a)\right]

and then apply the score function gradient estimator to the first QϕQ_\phi term and the reparameterization trickAssuming a continuous action. to the second,

θJ(θ)=EsD,aπθ[(θlogπθ(as))(Qπθ(s,a)Qϕ(s,a))]+EsD,ϵ[θQϕ(s,fθ(ϵ;s))]\begin{align*} \nabla_\theta J(\theta) &= \mathbb{E}_{s\sim \mathcal{D}, a\sim\pi_\theta} \left[\left(\nabla_\theta \log \pi_\theta(a | s)\right)\left(Q^{\pi_\theta(s, a)} - Q_\phi(s, a)\right)\right]\\ &\qquad+\mathbb{E}_{s\sim \mathcal{D}, \epsilon} \left[\nabla_\theta Q_\phi(s, f_\theta(\epsilon; s))\right] \end{align*}

where a=fθ(ϵ;s)a = f_\theta(\epsilon; s) reparameterizes the sampled action.

This is roughly what Q-prop does, except that they replace QϕQ_\phi with a Taylor approximation of QϕQ_\phi around the expected action μθ(s)=Ef(ϵ;s)\mu_\theta(s) = \mathbb{E} f(\epsilon; s) for some reason I don't fully understand. Possibly this reduces the variance further?

Connections to Bayesian optimization

writing inbox: flesh this out. what are the connections between DDPG and Bayesian optimization?? can we derive DDPG from a Bayesopt approach and does this suggest any generalizations?

a difference is that in bayesopt we usually learn a proxy objective directly as a function of the parameters θ\theta that we're optimizing. but here the parameters are of a policy, while the proxy objective is in action-space.

a proper bayesopt approach would maintain epistemic uncertainty over the objective values. that's like having distributional RL maintaining uncertainty over the Q values!

what is the appropriate analogue of an acquisition function? it would choose where to gather real-world experience, so in effect it would be an exploration policy. but it should depend on our uncertainty in the Q values! we would then 'act' (choose the next point in policy-space with which to gather a trajectory) at some point which corresponds to probably the maximum of some surrogate objective - expected improvement, an upper confidence bound, a Thompson sample, whatever. so these must fall out as natural techniques for exploration in distributional RL! and in fact from this perspective they are the natural ways to think about exploration in general!

another thing that falls out of this view is that something like PPO that optimizes a surrogate objective to try to wring a bit more value out of its last round of experience, is doing a weak version of a proper surrogate objective we can learn with a distributional Q-function!