Modified: July 28, 2022
maximum-entropy reinforcement learning
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.For any reward function and policy , consider the entropy-regularized reward
Taking as our objective the (expected, discounted) regularized reward yields maximum-entropy reinforcement learning.
Note that the entropy penalty subtly changes the structure of the RL problem: rather than a fixed, objective reward defined independent of the agent, we now have a different reward function for each policy; the reward 'shifts under our feet' in a sense. But this turns out not to matter very much once we move up to consider the regularized value functions and : since value functions were always policy-dependent, allowing this dependence into the reward function creates no special difficulties.
Basics
Concretely, the -regularized value of being in the initial state is
equal to the unregularized value , plus the expected discounted sum of policy entropies over the rest of the trajectory.
We define the -regularized action value as
so that and are related by
Note that this is a slight departure from the naïve derivation where we just substitute for everywhere. This would fold the current-step entropy term into , whereas we've pulled it out as a separate term in .
One reason to do this is that it highlights that we can incorporate the gradient of the current-step entropy exactly in a policy gradient. Suppose we want to optimize the objective
with respect to policy parameters . Then we can compute the policy gradient using the exact gradient of the entropy term, or a sampled gradient using the reparameterization trick, or just the score function. To complete the policy gradient we need an estimate of , either through a Monte Carlo trajectory or some version of TD error (in either case we can apply the score function trick as usual), or through a model which we can directly differentiate if we have continuous actions.
Q: what if we're using an -step temporal difference estimate of values? Then we do also know the analytic gradients of entropies at the next time steps and should presumably use them.
Treating the expressions above as Bellman backup operators, we can iteratively apply them to perform soft policy evaluation.
Soft Q-learning
Reference: Reinforcement Learning with Deep Energy-Based Policies
The maxent RL problem has an optimal policy, which we can in principle find (up to local maxima) through policy gradient methods. But what if we want to do value-based learning, without representing the policy explicitly? In particular, suppose we are given action values at a particular state. If the actions are discrete, then the optimal policy at that state is given by
Setting the gradient inside the argmax to zero and solving for , we find that the optimal policy is a softmax over scaled -values,
which, we are satisfied to notice, approaches a point mass at the highest-value action as , and approaches the uniform distribution as . Plugging this back into the objective, with a bit of algebra we see that this policy achieves a regularized value equal to the log-sum-exp of the values
which (also satisfyingly) approaches in the setting where the distribution concentrates at the highest-value action.
From this value we can derive a temporal difference error
and related quantities. Then we can do Q-learning by minimizing this TD error, e.g., through gradient descent on .
Is double Q-learning meaningful in this setting? In the unregularized deterministic setting, we would evaluate the optimal policy (max Q) using one network to choose the best action (argmax Q) and the other to evaluate that action. Here I guess we do the same thing? We use one network to define the softmax distribution over actions, with which we take the expectation, and another network to evaluate them? Let and denote the softmax policies corresponding to and respectively. Then we have
which corrects the naive estimate. We can check that this recovers the usual double Q estimate in the limit as . The softmax first term becomes a hard max under ; letting denote the optimal action (argmax) under , this evaluates to . The expectation in the second term concentrates at , so we simply evaluate the inner quantity , and applying this correction to the first term we recover as we'd hope!
Soft Actor-Critic
Given any policy , we can apply the Bellman operators above to learn soft value functions and , which we parameterize using function approximators and . Following the above, the optimal policy with respect to these is simply the softmax policy on -values. Equivalently it is the policy that minimizes
so we can train a parameterized policy using gradients of this objective. If the actions are continuous, we can use reparameterization gradients (since here is a differentiable function approximator) - this is essentially just deep deterministic policy gradients (DDPG) with an entropy term, since the reparameterization trick effectively gives us a deterministic policy as a function of exogenous noise . For small discrete action spaces, we can also do the expectation analytically (and otherwise use the score function trick or other gradient estimator of our choice).
Exploration
As probabilistic inference
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review (Levine, 2018)
A straightforward mapping between RL and probabilistic inference is that, for a given observed state trajectory
- the actions are the unknown latent variables
- the discounted advantages or, equivalently, Q-values , define the joint log-density
- the optimal policy is the posterior .
Concretely, if we identify a parameterized policy with an amortized surrogate posterior in variational inference, then the ELBO
becomes the maxent RL objective
References
- Spinning up in deep RL: https://spinningup.openai.com/en/latest/algorithms/sac.html
- Equivalence Between Policy Gradients and Soft Q-Learning: https://arxiv.org/abs/1704.06440