Created: April 22, 2022
Modified: July 28, 2022

maximum-entropy reinforcement learning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

For any reward function $r(s, a, s')$ and policy $\pi$ , consider the entropy-regularized reward

r_\beta^\pi(s, a, s') = r(s, a, s') + \beta H(\pi(\cdot | s)).

Taking as our objective the (expected, discounted) regularized reward $r_\beta^\pi$ yields maximum-entropy reinforcement learning.

Note that the entropy penalty subtly changes the structure of the RL problem: rather than a fixed, objective reward defined independent of the agent, we now have a different reward function for each policy; the reward 'shifts under our feet' in a sense. But this turns out not to matter very much once we move up to consider the regularized value functions $V_\beta^\pi$ and $Q_\beta^\pi$ : since value functions were always policy-dependent, allowing this dependence into the reward function creates no special difficulties.

Basics

Concretely, the $\pi$ -regularized value of being in the initial state $s_0$ is

\begin{align*} V_\beta^{\pi}(s_0) &= \mathbb{E}\left[\sum_{t=0}^{T}\gamma^k \left(r(s_t, a_t, s_{t+1}) + \beta H(\pi(\cdot | s_{t}))\right)\right],\\ &=V^{\pi}(s_0) + \beta\mathbb{E}\left[\sum_{t=0}^T\gamma^k H(\pi(\cdot | s_{t}))\right], \end{align*}

equal to the unregularized value $V^\pi$ , plus the expected discounted sum of policy entropies over the rest of the trajectory.

We define the $\pi$ -regularized action value as

Q^{\pi}_\beta(s, a) = \mathbb{E}_{s' | s, a}\left[r(s, a, s') + \gamma V_\beta^\pi(s')\right];

so that $V$ and $Q$ are related by

V^\pi_\beta(s) = \mathbb{E}_{a\sim \pi}\left[Q_\beta^\pi(s, a)\right] + \beta H(\pi(\cdot | s))

Note that this is a slight departure from the naïve derivation where we just substitute $r_\beta^\pi$ for $r$ everywhere. This would fold the current-step entropy term $H(\pi(\cdot | s))$ into $Q^\pi_\beta(s, a)$ , whereas we've pulled it out as a separate term in $V^\pi_\beta(s)$ .

One reason to do this is that it highlights that we can incorporate the gradient of the current-step entropy exactly in a policy gradient. Suppose we want to optimize the objective

\begin{align*} J(\theta) &= \mathbb{E}_{s_0}V^{\pi_\theta}_{\beta}(s_0)\\ &= \mathbb{E}_{s_0, a_0\sim \pi_\theta} \left[Q^{\pi_\theta}_{\beta}(s_0, a_0) - \log \pi_\theta(a_0 | s_0)\right] \end{align*}

with respect to policy parameters $\theta$ . Then we can compute the policy gradient $\nabla_\theta J(\theta)$ using the exact gradient of the entropy term, or a sampled gradient using the reparameterization trick, or just the score function. To complete the policy gradient we need an estimate of $Q^{\pi_\theta}_\beta$ , either through a Monte Carlo trajectory or some version of TD error (in either case we can apply the score function trick as usual), or through a model $Q_{\phi,\beta}$ which we can directly differentiate if we have continuous actions.

Q: what if we're using an $n$ -step temporal difference estimate of $Q$ values? Then we do also know the analytic gradients of entropies at the next $n$ time steps and should presumably use them.

Treating the expressions above as Bellman backup operators, we can iteratively apply them to perform soft policy evaluation.

Soft Q-learning

Reference: Reinforcement Learning with Deep Energy-Based Policies

The maxent RL problem has an optimal policy, which we can in principle find (up to local maxima) through policy gradient methods. But what if we want to do value-based learning, without representing the policy explicitly? In particular, suppose we are given action values $Q_{\theta,\beta}(s, \cdot)$ at a particular state. If the actions are discrete, then the optimal policy at that state is given by

\begin{align*} \pi_\theta^*(\cdot | s) = \text{argmax}_{\|p\|=1} \sum_{a=1}^{|\mathcal{A}|} p_a\left[Q_{\theta, \beta}(s, a) -\beta \log p_a\right] \end{align*}

Setting the gradient inside the argmax to zero and solving for $p_a$ , we find that the optimal policy is a softmax over scaled $Q$ -values,

\pi^*_\theta(a | s) = \frac{1}{Z(s)}\exp\left(\frac{1}{\beta}Q_{\theta, \beta}(s, a)\right)

which, we are satisfied to notice, approaches a point mass at the highest-value action as $\beta\to0$ , and approaches the uniform distribution as $\beta\to\infty$ . Plugging this back into the objective, with a bit of algebra we see that this policy achieves a regularized value equal to the log-sum-exp of the $Q$ values

\begin{align*} V_\beta(s) &= E_{a\sim \pi^*}\left[Q_{\theta,\beta}(s, a) - \beta \log \pi^*(a|s)\right]\\ &= \sum_{a=1}^{|A|}\pi^*(a|s)\left[Q(s, a) - \beta\log e^{\frac{1}{\beta}Q(s, a)} + \beta \log Z(s)\right]\\ &= \beta \log Z(s)\\ &= \beta \log \sum_{a=1}^{|A|}\exp\left(\frac{1}{\beta}Q_{\theta,\beta}(s, a)\right) \end{align*}

which (also satisfyingly) approaches $\max_a Q(s, a)$ in the $\beta\to 0$ setting where the distribution concentrates at the highest-value action.

From this value we can derive a temporal difference error

\delta_t = \left[r_{t+1} + \gamma \beta \log \sum_{a'=1}^{|A|}\exp\left(\frac{1}{\beta}Q_{\theta,\beta}(s_{t+1}, a')\right)\right] - Q_{\theta,\beta}(s_t, a_t)

and related quantities. Then we can do Q-learning by minimizing this TD error, e.g., through gradient descent on $\theta$ .

Is double Q-learning meaningful in this setting? In the unregularized deterministic setting, we would evaluate the optimal policy (max Q) using one network to choose the best action (argmax Q) and the other to evaluate that action. Here I guess we do the same thing? We use one network $Q_1$ to define the softmax distribution over actions, with which we take the expectation, and another network $Q_2$ to evaluate them? Let $\pi^*_1$ and $\pi^*_2$ denote the softmax policies corresponding to $Q_1$ and $Q_2$ respectively. Then we have

\begin{align*} V_\beta(s) &= E_{a\sim \pi_1^*}\left[Q_2(s, a) - \beta \log \pi_1^*(a|s)\right]\\ &= \beta\log Z_1(s) + E_{a\sim \pi_1^*}\left[Q_2(s, a) - Q_1(s, a)\right]\\ \end{align*}

which corrects the naive estimate. We can check that this recovers the usual double Q estimate in the limit as $\beta \to 0$ . The softmax first term becomes a hard max under $Q_1$ ; letting $a^*_1$ denote the optimal action (argmax) under $Q_1$ , this evaluates to $Q_1(s, a^*_1)$ . The expectation in the second term concentrates at $a_1^*$ , so we simply evaluate the inner quantity $Q_2(s, a_1^*) - Q_1(s, a_1^*)$ , and applying this correction to the first term we recover $Q_2(s, a_1^*)$ as we'd hope!

Soft Actor-Critic

Given any policy $\pi$ , we can apply the Bellman operators above to learn soft value functions $Q^{\pi}_\beta$ and $V^\pi_\beta$ , which we parameterize using function approximators $Q_{\phi,\beta}$ and $V_{\phi,\beta}$ . Following the above, the optimal policy with respect to these is simply the softmax policy on $Q$ -values. Equivalently it is the policy that minimizes

\begin{align*} J(\theta) &= \mathbb{E}_{s\sim d_\pi} \left[D_\text{KL}\left(\pi_\theta(\cdot | s) \left\| \frac{e^{Q(s, \cdot)}}{Z(s)}\right.\right)\right]\\ &= -\mathbb{E}_{s\sim d_\pi, a\sim \pi(\cdot|s)} \left[Q(s, a) - \log \pi_\theta(a|s)\right]\end{align*}

so we can train a parameterized policy $\pi_\theta$ using gradients of this objective. If the actions are continuous, we can use reparameterization gradients (since $Q$ here is a differentiable function approximator) - this is essentially just deep deterministic policy gradients (DDPG) with an entropy term, since the reparameterization trick effectively gives us a deterministic policy as a function of exogenous noise $\epsilon$ . For small discrete action spaces, we can also do the expectation analytically (and otherwise use the score function trick or other gradient estimator of our choice).

Exploration

As probabilistic inference

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review (Levine, 2018)

A straightforward mapping between RL and probabilistic inference is that, for a given observed state trajectory $\mathbf{x} = s_0, \ldots, s_T$

the actions $\mathbf{z} = a_0, \ldots, a_{T-1}$ are the unknown latent variables
the discounted advantages $\sum_t \gamma^t A(s_t, a_t)$ or, equivalently, Q-values $\sum_t \gamma^t Q(s_t, a_t)$ , define the joint log-density $\log p(\mathbf{x}, \mathbf{z})$
the optimal policy $\pi^*(a_t|s_t)$ is the posterior $p(\mathbf{z}|\mathbf{x})$ .

Concretely, if we identify a parameterized policy $\pi_\theta$ with an amortized surrogate posterior $q_\theta(z_i|x_i)$ in variational inference, then the ELBO

\mathbb{E}_{x\sim \mathcal{D_X}}\mathbb{E}_{z\sim q_\theta(\cdot | x)} \log p(x, z) - \log q_\theta(z|x)

becomes the maxent RL objective

\mathbb{E}_{s\sim \mathcal{D_S}}\mathbb{E}_{a\sim \pi_\theta(\cdot | s)} Q(s, a) - \log \pi_\theta(a|s)

References

Spinning up in deep RL: https://spinningup.openai.com/en/latest/algorithms/sac.html
Equivalence Between Policy Gradients and Soft Q-Learning: https://arxiv.org/abs/1704.06440

maximum-entropy reinforcement learning

Basics

Soft Q-learning

Soft Actor-Critic

Exploration

As probabilistic inference

References

Links to this note

reinforcement learning from human feedback

policy gradient

Meta