maximum-entropy reinforcement learning: Nonlinear Function
Created: April 22, 2022
Modified: July 28, 2022

maximum-entropy reinforcement learning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

For any reward function r(s,a,s)r(s, a, s') and policy π\pi, consider the entropy-regularized reward

rβπ(s,a,s)=r(s,a,s)+βH(π(s)).r_\beta^\pi(s, a, s') = r(s, a, s') + \beta H(\pi(\cdot | s)).

Taking as our objective the (expected, discounted) regularized reward rβπr_\beta^\pi yields maximum-entropy reinforcement learning.

Note that the entropy penalty subtly changes the structure of the RL problem: rather than a fixed, objective reward defined independent of the agent, we now have a different reward function for each policy; the reward 'shifts under our feet' in a sense. But this turns out not to matter very much once we move up to consider the regularized value functions VβπV_\beta^\pi and QβπQ_\beta^\pi: since value functions were always policy-dependent, allowing this dependence into the reward function creates no special difficulties.

Basics

Concretely, the π\pi-regularized value of being in the initial state s0s_0 is

Vβπ(s0)=E[t=0Tγk(r(st,at,st+1)+βH(π(st)))],=Vπ(s0)+βE[t=0TγkH(π(st))],\begin{align*} V_\beta^{\pi}(s_0) &= \mathbb{E}\left[\sum_{t=0}^{T}\gamma^k \left(r(s_t, a_t, s_{t+1}) + \beta H(\pi(\cdot | s_{t}))\right)\right],\\ &=V^{\pi}(s_0) + \beta\mathbb{E}\left[\sum_{t=0}^T\gamma^k H(\pi(\cdot | s_{t}))\right], \end{align*}

equal to the unregularized value VπV^\pi, plus the expected discounted sum of policy entropies over the rest of the trajectory.

We define the π\pi-regularized action value as

Qβπ(s,a)=Ess,a[r(s,a,s)+γVβπ(s)];Q^{\pi}_\beta(s, a) = \mathbb{E}_{s' | s, a}\left[r(s, a, s') + \gamma V_\beta^\pi(s')\right];

so that VV and QQ are related by

Vβπ(s)=Eaπ[Qβπ(s,a)]+βH(π(s))V^\pi_\beta(s) = \mathbb{E}_{a\sim \pi}\left[Q_\beta^\pi(s, a)\right] + \beta H(\pi(\cdot | s))

Note that this is a slight departure from the naïve derivation where we just substitute rβπr_\beta^\pi for rr everywhere. This would fold the current-step entropy term H(π(s))H(\pi(\cdot | s)) into Qβπ(s,a)Q^\pi_\beta(s, a), whereas we've pulled it out as a separate term in Vβπ(s)V^\pi_\beta(s).

One reason to do this is that it highlights that we can incorporate the gradient of the current-step entropy exactly in a policy gradient. Suppose we want to optimize the objective

J(θ)=Es0Vβπθ(s0)=Es0,a0πθ[Qβπθ(s0,a0)logπθ(a0s0)]\begin{align*} J(\theta) &= \mathbb{E}_{s_0}V^{\pi_\theta}_{\beta}(s_0)\\ &= \mathbb{E}_{s_0, a_0\sim \pi_\theta} \left[Q^{\pi_\theta}_{\beta}(s_0, a_0) - \log \pi_\theta(a_0 | s_0)\right] \end{align*}

with respect to policy parameters θ\theta. Then we can compute the policy gradient θJ(θ)\nabla_\theta J(\theta) using the exact gradient of the entropy term, or a sampled gradient using the reparameterization trick, or just the score function. To complete the policy gradient we need an estimate of QβπθQ^{\pi_\theta}_\beta, either through a Monte Carlo trajectory or some version of TD error (in either case we can apply the score function trick as usual), or through a model Qϕ,βQ_{\phi,\beta} which we can directly differentiate if we have continuous actions.

Q: what if we're using an nn-step temporal difference estimate of QQ values? Then we do also know the analytic gradients of entropies at the next nn time steps and should presumably use them.

Treating the expressions above as Bellman backup operators, we can iteratively apply them to perform soft policy evaluation.

Soft Q-learning

Reference: Reinforcement Learning with Deep Energy-Based Policies

The maxent RL problem has an optimal policy, which we can in principle find (up to local maxima) through policy gradient methods. But what if we want to do value-based learning, without representing the policy explicitly? In particular, suppose we are given action values Qθ,β(s,)Q_{\theta,\beta}(s, \cdot) at a particular state. If the actions are discrete, then the optimal policy at that state is given by

πθ(s)=argmaxp=1a=1Apa[Qθ,β(s,a)βlogpa]\begin{align*} \pi_\theta^*(\cdot | s) = \text{argmax}_{\|p\|=1} \sum_{a=1}^{|\mathcal{A}|} p_a\left[Q_{\theta, \beta}(s, a) -\beta \log p_a\right] \end{align*}

Setting the gradient inside the argmax to zero and solving for pap_a, we find that the optimal policy is a softmax over scaled QQ-values,

πθ(as)=1Z(s)exp(1βQθ,β(s,a))\pi^*_\theta(a | s) = \frac{1}{Z(s)}\exp\left(\frac{1}{\beta}Q_{\theta, \beta}(s, a)\right)

which, we are satisfied to notice, approaches a point mass at the highest-value action as β0\beta\to0, and approaches the uniform distribution as β\beta\to\infty. Plugging this back into the objective, with a bit of algebra we see that this policy achieves a regularized value equal to the log-sum-exp of the QQ values

Vβ(s)=Eaπ[Qθ,β(s,a)βlogπ(as)]=a=1Aπ(as)[Q(s,a)βloge1βQ(s,a)+βlogZ(s)]=βlogZ(s)=βloga=1Aexp(1βQθ,β(s,a))\begin{align*} V_\beta(s) &= E_{a\sim \pi^*}\left[Q_{\theta,\beta}(s, a) - \beta \log \pi^*(a|s)\right]\\ &= \sum_{a=1}^{|A|}\pi^*(a|s)\left[Q(s, a) - \beta\log e^{\frac{1}{\beta}Q(s, a)} + \beta \log Z(s)\right]\\ &= \beta \log Z(s)\\ &= \beta \log \sum_{a=1}^{|A|}\exp\left(\frac{1}{\beta}Q_{\theta,\beta}(s, a)\right) \end{align*}

which (also satisfyingly) approaches maxaQ(s,a)\max_a Q(s, a) in the β0\beta\to 0 setting where the distribution concentrates at the highest-value action.

From this value we can derive a temporal difference error

δt=[rt+1+γβloga=1Aexp(1βQθ,β(st+1,a))]Qθ,β(st,at)\delta_t = \left[r_{t+1} + \gamma \beta \log \sum_{a'=1}^{|A|}\exp\left(\frac{1}{\beta}Q_{\theta,\beta}(s_{t+1}, a')\right)\right] - Q_{\theta,\beta}(s_t, a_t)

and related quantities. Then we can do Q-learning by minimizing this TD error, e.g., through gradient descent on θ\theta.

Is double Q-learning meaningful in this setting? In the unregularized deterministic setting, we would evaluate the optimal policy (max Q) using one network to choose the best action (argmax Q) and the other to evaluate that action. Here I guess we do the same thing? We use one network Q1Q_1 to define the softmax distribution over actions, with which we take the expectation, and another network Q2Q_2 to evaluate them? Let π1\pi^*_1 and π2\pi^*_2 denote the softmax policies corresponding to Q1Q_1 and Q2Q_2 respectively. Then we have

Vβ(s)=Eaπ1[Q2(s,a)βlogπ1(as)]=βlogZ1(s)+Eaπ1[Q2(s,a)Q1(s,a)]\begin{align*} V_\beta(s) &= E_{a\sim \pi_1^*}\left[Q_2(s, a) - \beta \log \pi_1^*(a|s)\right]\\ &= \beta\log Z_1(s) + E_{a\sim \pi_1^*}\left[Q_2(s, a) - Q_1(s, a)\right]\\ \end{align*}

which corrects the naive estimate. We can check that this recovers the usual double Q estimate in the limit as β0\beta \to 0. The softmax first term becomes a hard max under Q1Q_1; letting a1a^*_1 denote the optimal action (argmax) under Q1Q_1, this evaluates to Q1(s,a1)Q_1(s, a^*_1). The expectation in the second term concentrates at a1a_1^*, so we simply evaluate the inner quantity Q2(s,a1)Q1(s,a1)Q_2(s, a_1^*) - Q_1(s, a_1^*), and applying this correction to the first term we recover Q2(s,a1)Q_2(s, a_1^*) as we'd hope!

Soft Actor-Critic

Given any policy π\pi, we can apply the Bellman operators above to learn soft value functions QβπQ^{\pi}_\beta and VβπV^\pi_\beta, which we parameterize using function approximators Qϕ,βQ_{\phi,\beta} and Vϕ,βV_{\phi,\beta}. Following the above, the optimal policy with respect to these is simply the softmax policy on QQ-values. Equivalently it is the policy that minimizes

J(θ)=Esdπ[DKL(πθ(s)eQ(s,)Z(s))]=Esdπ,aπ(s)[Q(s,a)logπθ(as)]\begin{align*} J(\theta) &= \mathbb{E}_{s\sim d_\pi} \left[D_\text{KL}\left(\pi_\theta(\cdot | s) \left\| \frac{e^{Q(s, \cdot)}}{Z(s)}\right.\right)\right]\\ &= -\mathbb{E}_{s\sim d_\pi, a\sim \pi(\cdot|s)} \left[Q(s, a) - \log \pi_\theta(a|s)\right]\end{align*}

so we can train a parameterized policy πθ\pi_\theta using gradients of this objective. If the actions are continuous, we can use reparameterization gradients (since QQ here is a differentiable function approximator) - this is essentially just deep deterministic policy gradients (DDPG) with an entropy term, since the reparameterization trick effectively gives us a deterministic policy as a function of exogenous noise ϵ\epsilon. For small discrete action spaces, we can also do the expectation analytically (and otherwise use the score function trick or other gradient estimator of our choice).

Exploration

As probabilistic inference

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review (Levine, 2018)

A straightforward mapping between RL and probabilistic inference is that, for a given observed state trajectory x=s0,,sT\mathbf{x} = s_0, \ldots, s_T

  • the actions z=a0,,aT1\mathbf{z} = a_0, \ldots, a_{T-1} are the unknown latent variables
  • the discounted advantages tγtA(st,at)\sum_t \gamma^t A(s_t, a_t) or, equivalently, Q-values tγtQ(st,at)\sum_t \gamma^t Q(s_t, a_t), define the joint log-density logp(x,z)\log p(\mathbf{x}, \mathbf{z})
  • the optimal policy π(atst)\pi^*(a_t|s_t) is the posterior p(zx)p(\mathbf{z}|\mathbf{x}).

Concretely, if we identify a parameterized policy πθ\pi_\theta with an amortized surrogate posterior qθ(zixi)q_\theta(z_i|x_i) in variational inference, then the ELBO

ExDXEzqθ(x)logp(x,z)logqθ(zx)\mathbb{E}_{x\sim \mathcal{D_X}}\mathbb{E}_{z\sim q_\theta(\cdot | x)} \log p(x, z) - \log q_\theta(z|x)

becomes the maxent RL objective

EsDSEaπθ(s)Q(s,a)logπθ(as)\mathbb{E}_{s\sim \mathcal{D_S}}\mathbb{E}_{a\sim \pi_\theta(\cdot | s)} Q(s, a) - \log \pi_\theta(a|s)

References