Modified: October 03, 2020
mirror descent
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Mirror descent is a framework for optimization algorithms: many algorithms can be framed as mirror descent, and proofs about mirror descent generalize to all of those algorithms.
It fixes the 'type error' in traditional gradient descent: gradients are linear functionals in a dual space, so it's bizarre to add them directly to points in the primal space. We need some kind of map from the dual space into an update we can apply in the primal space. Of course, type systems and dual spaces are human constructions
I've come towards a viewpoint I like: mirror descent is the study of preconditioning by reparameterization. That is, in general we can precondition an optimization by a positive-definite matrix, or more generally, a smoothly varying set of posdef matrices . This means we take as our updates. An 'optimal' in the quadratic case would be the inverse Hessian , which gives us Newton's method.
How can we choose such matrices? We might think of convex functions: a function is convex if and only if its Hessian is everywhere positive definite. So a convex function defines a space of preconditioners, by its second derivatives / Hessian.
- And perhaps the converse is also true, that any smooth preconditioning function can be integrated twice to give a convex function?
- Certainly in 1D, for any smooth positive-valued function, we can integrate twice to get a convex function. For example, integrates twice to give .
- In higher-d, it's not true that all vector fields are the gradients of a scalar potential; the ones that are 'conservative' vector fields, and satisfy properties like, the path integral between any pair of points depends only on the points, not the path. Is there an equivalent notion of when a 'matrix field' is generated by the Hessians of a scalar function?
- NameRedacted says that this paper https://ecommons.cornell.edu/bitstream/handle/1813/9264/TR001387.pdf?sequence=1 addresses this
So choosing a convex function is choosing a set of preconditioners. Then mirror descent observes that taking a mirror step
is equivalent to taking a step preconditioned by the inverse Hessian of
So: mirror descent lets us take steps in the inverse-Hessian-preconditioned space, at the cost only of computing gradients of the mirror function and its conjugate. In that sense it gives us second-order power for first-order cost.
Unfortunately, this can still be costly in cases where the mirror function or its conjugate is something inherently expensive, like an entropy or log-normalizer of a general distribution.
Exercise: why doesn't this make Newton's method cheaper? Say we have a convex loss f(x) and choose . Mapping to the mirror space is cheap, and doing the update in that space is cheap. But how do we translate back to the original space? We need the conjugate of , or equivalently the (local) inverse of the gradient map, , and getting that is expensive. In general we can get a local inverse by taking a Taylor approximation of the gradient map, and inverting that---but now we're back to reifying the Hessian and inverting it.
From Nick Harvey's notes: A mirror descent algorithm is defined by a mirror function where . The mirror function must be:
- strictly convex and everywhere differentiable
- the 'dual space' of its gradient must equal all of
- the gradient must diverge to on the boundaries of the domain
The mirror function defines a mirror map into the space of linear functionals. This inverse of this mirror map is given by the convex conjugate: .
The mirror descent algorithm to optimize an objective with a mirror function subject to a constraint is:
- Compute a subgradient at the current point .
- Apply the subgradient in the 'mirror' space:
- consider current point in mirror space
- apply the gradient with learning rate:
- move back to the primal space:
- Now project back onto the constraint set using the Bregman divergence :
If we take the squared Euclidean norm as our mirror function, , then mirror descent reduces to projected subgradient descent.
A common mirror function is negative entropy, for . In this case we have (generalized KL divergence), and the 'dual space' has the interpretation of log probabilities. Additive updates in the dual space then become multiplicative updates in the primal space. Of course, in practice one would probably implement everything in terms of gradient updates on unnormalized logits?
Can I write a mirror descent update in a single line?
or
or
and by linearity this is equal to
- This last form implies that the mirror function functions as a regularization term. We are trying to find the point in our constraint space that maximally aligns with the 'goal' functional while minimizing . If , this is regularization; if then this is maximum-entropy regularization.
- How does mirror descent connect to natural gradient? Let's think back to the variational formulation of natural gradient: we want to find the point that maximizes a linear approximation of within a KL-ball around our current point:where is a Lagrange multiplier, which turns out to be just
- From the Stanford notes, we can consider a general variational problem where we want to take steps measured by the Bregman divergence:
- and the special case of negative entropy for recovers natural gradient.
- This is apparently also just mirror descent.
- In what sense is the Bayesian learning rule mirror descent? We are optimizing an objective where is an exponential family distribution with natural parameter . Now we take the log-normalizer as our mirror function. Immediate consequences:
- The Bregman divergence is the KL divergence .
- The mirror space is that of expectation parameters .
- The mirror descent update would ask us to compute the update . I'm not sure how helpful this is.
- The Bregman divergence is the KL divergence
- Let's consider the opposite interpretation. Let be the mirror function. Then the update is where
- Let's consider the opposite interpretation. Let be the mirror function. Then the update is where
is the Fisher information matrix, because exponential families are log-linear, so the Hessian of the log-likelihood is constant and equal to the Fisher (which would ordinarily be the expected Hessian):
So we conclude that natural gradient on is mirror descent on .
- Is the converse of this still helpful? Mirror descent in is natural gradient in . It might sometimes be nice in practice to do descent on because it's numerically better-behaved. But we'd have to take gradients with respect to anyway, which I think defeats the point?