Created: September 07, 2020
Modified: October 03, 2020

mirror descent

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Mirror descent is a framework for optimization algorithms: many algorithms can be framed as mirror descent, and proofs about mirror descent generalize to all of those algorithms.

It fixes the 'type error' in traditional gradient descent: gradients are linear functionals in a dual space, so it's bizarre to add them directly to points in the primal space. We need some kind of map from the dual space into an update we can apply in the primal space. Of course, type systems and dual spaces are human constructions

I've come towards a viewpoint I like: mirror descent is the study of preconditioning by reparameterization. That is, in general we can precondition an optimization by a positive-definite matrix, or more generally, a smoothly varying set of posdef matrices $A(x)$ . This means we take $g_A(x) = A(x)\nabla_x f(x)$ as our updates. An 'optimal' $A$ in the quadratic case would be the inverse Hessian $A(x) = \nabla^{-2}_x f(x)$ , which gives us Newton's method.

How can we choose such matrices? We might think of convex functions: a function is convex if and only if its Hessian is everywhere positive definite. So a convex function defines a space of preconditioners, by its second derivatives / Hessian.

And perhaps the converse is also true, that any smooth preconditioning function $A(x)$ can be integrated twice to give a convex function?
Certainly in 1D, for any smooth positive-valued function, we can integrate twice to get a convex function. For example, $f(x) = 1$ integrates twice to give $x^2$ .
In higher-d, it's not true that all vector fields are the gradients of a scalar potential; the ones that are 'conservative' vector fields, and satisfy properties like, the path integral between any pair of points depends only on the points, not the path. Is there an equivalent notion of when a 'matrix field' is generated by the Hessians of a scalar function?
- NameRedacted says that this paper https://ecommons.cornell.edu/bitstream/handle/1813/9264/TR001387.pdf?sequence=1 addresses this

So choosing a convex function $\Phi$ is choosing a set of preconditioners. Then mirror descent observes that taking a mirror step

\mu' = \mu + \alpha \nabla_x f(x)

is equivalent to taking a step preconditioned by the inverse Hessian of $\Phi$

\begin{aligned}x(\mu') &= x\left(\mu + \alpha \nabla_x f(x)\right)\\ &\approx_\text{Taylor} x(\mu) + J_{(\mu \to x)} \left(\mu' - \mu\right)\\ &= x + \alpha J_{(\mu \to \nabla_\mu \Phi^*(\mu))} \nabla_x f(x)\\ &=x + \alpha \nabla^2_\mu \Phi^*(\mu) \nabla_x f(x)\\ &=x + \alpha \nabla^{-2}_x \Phi(x) \nabla_x f(x) \end{aligned}

So: mirror descent lets us take steps in the inverse-Hessian-preconditioned space, at the cost only of computing gradients of the mirror function and its conjugate. In that sense it gives us second-order power for first-order cost.

Unfortunately, this can still be costly in cases where the mirror function or its conjugate is something inherently expensive, like an entropy or log-normalizer of a general distribution.

Exercise: why doesn't this make Newton's method cheaper? Say we have a convex loss f(x) and choose $\phi(x) = f(x)$ . Mapping to the mirror space $\mu = \nabla_x f(x)$ is cheap, and doing the update in that space is cheap. But how do we translate back to the original space? We need the conjugate of $f$ , or equivalently the (local) inverse of the gradient map, $(\nabla_x f(x))^{-1}$ , and getting that is expensive. In general we can get a local inverse by taking a Taylor approximation of the gradient map, and inverting that---but now we're back to reifying the Hessian and inverting it.
From Nick Harvey's notes: A mirror descent algorithm is defined by a mirror function $\Phi(x): \mathcal{D} \to \mathbb{R}$ where $\mathcal{D} \in \mathbb{R}^n$ . The mirror function must be:
- strictly convex and everywhere differentiable
- the 'dual space' $\{\nabla \Phi(x); x \in \mathcal{D}\}$ of its gradient must equal all of $\mathbb{R}^n$
- the gradient $\nabla \Phi(x)$ must diverge to $\infty$ on the boundaries of the domain $\mathcal{D}$
The mirror function defines a mirror map $\hat{x} = \nabla \Phi(x)$ into the space of linear functionals. This inverse of this mirror map is given by the convex conjugate: $x = \nabla \Phi^*(\hat{x})$ .
The mirror descent algorithm to optimize an objective $f(x)$ with a mirror function $\Phi$ subject to a constraint $x \in \mathcal{X} \subseteq D$ is:
1. Compute a subgradient $g_t = \nabla f(x_t)$ at the current point $x_t \in \mathcal{X}$ .
2. Apply the subgradient in the 'mirror' space:
  - consider current point in mirror space $\hat{x}_t = \nabla \Phi(x_t)$
  - apply the gradient with learning rate: $\hat{y}_{t + 1} = \hat{x_t} + \alpha_t g_t$
  - move back to the primal space: $y_{t+1} = \nabla \Phi^*(\hat{y}_{t+1})$
3. Now project back onto the constraint set $\mathcal{X}$ using the Bregman divergence $D_\Phi(x, y) = \Phi(x) - \Phi(y) - \langle x - y, \nabla \Phi(y) \rangle$ :
  - $x_{t+1} = \text{argmin}_{x \in \mathcal{X}} D_\Phi(x, y_{t+1})$
If we take the squared Euclidean norm as our mirror function, $\Phi(x) = \|x\|^2_2$ , then mirror descent reduces to projected subgradient descent.
A common mirror function is negative entropy, $\Phi(x) = \sum_i x_i \log x_i$ for $x \in \mathcal{X} = \Delta^{n}$ . In this case we have $D_\Phi(x, y) = KL[x \| y] + \|x\|_1 - \|y\|_1$ (generalized KL divergence), and the 'dual space' has the interpretation of log probabilities. Additive updates in the dual space then become multiplicative updates in the primal space. Of course, in practice one would probably implement everything in terms of gradient updates on unnormalized logits?
Can I write a mirror descent update in a single line?
$x_{t+1} = \text{argmin}_{x\in \mathcal{X}} D_\Phi \left(x, \nabla \Phi^*(\nabla \Phi(x_t) + \nabla f(x_t))\right)$
or
$x_{t+1} = \text{argmin}_{x\in \mathcal{X}} \left( \Phi(x) - \Phi(\nabla \Phi^*(\nabla \Phi(x_t) + \nabla f(x_t))) - \left \langle x - \nabla \Phi^*(\nabla \Phi(x_t) + \nabla f(x_t) ), \nabla \Phi(\nabla \Phi^*(\nabla \Phi(x_t) + \nabla f(x_t)))\right \rangle \right)$

\hat{y}_{t+1} = \nabla \Phi(x_t) + \alpha \nabla f(x_t)

y_{t+1} = \nabla \Phi^*(\hat{y}_{t+1})

x_{t+1} =\text{argmin}_{x\in \mathcal{X}} \left( \Phi(x) - \left \langle x, \hat{y}_{t+1}\right \rangle \right)

and by linearity this is equal to

x_{t+1} =\text{argmin}_{x\in \mathcal{X}} \left( \Phi(x) - \left \langle x, \alpha \nabla f(x_{t})\right \rangle \right)

This last form implies that the mirror function $\Phi(x)$ functions as a regularization term. We are trying to find the point $x$ in our constraint space that maximally aligns with the 'goal' functional $\hat{y}_{t+1}$ while minimizing $\Phi(x)$ . If $\Phi(x) = \|x\|^2_2$ , this is $L_2$ regularization; if $\Phi(x) = \sum_i x_i \log x_i$ then this is maximum-entropy regularization.
How does mirror descent connect to natural gradient? Let's think back to the variational formulation of natural gradient: we want to find the point that maximizes a linear approximation of $f$ within a KL-ball around our current point: $x_{t+1} - x_t = \text{argmax}_d \langle d, \nabla f(x_t)\rangle + \lambda d^T F d$ where $\lambda$ is a Lagrange multiplier, which turns out to be just $d^* = F^{-1} \nabla f(x_t)$
From the Stanford notes, we can consider a general variational problem where we want to take steps measured by the Bregman divergence:
$x_{t+1} = \text{argmin}_{x\in \mathcal{X}} D_\Phi(x, x_t) + \langle x - x_t, \nabla f(x_t)\rangle.$
- and the special case of negative entropy for $\Phi$ recovers natural gradient.
- This is apparently also just mirror descent.
In what sense is the Bayesian learning rule mirror descent? We are optimizing an objective $L(\lambda) = E_{q_\lambda} \ell(x)$ where $q$ is an exponential family distribution with natural parameter $\lambda$ . Now we take the log-normalizer $A(\lambda) = \log \int \langle \lambda, t(x)\rangle dx$ as our mirror function. Immediate consequences:
- The Bregman divergence $D_A(\lambda, \eta) = A(\lambda) - A(\eta) - \langle \lambda - \eta, \nabla A(\eta) \rangle$ is the KL divergence $KL[q_\eta \| q_\lambda]$ .
- The mirror space is that of expectation parameters $\mu$ .
- The mirror descent update would ask us to compute the update $\mu' = \mu + \nabla_\lambda L(\lambda)$ . I'm not sure how helpful this is.
Let's consider the opposite interpretation. Let $A^*$ be the mirror function. Then the update is $\lambda' = \lambda + \nabla_\mu L(\lambda(\mu)) = \lambda + J_{\lambda\to\mu}^{-1} \nabla_\lambda$ where
Let's consider the opposite interpretation. Let $A^*$ be the mirror function. Then the update is $\lambda' = \lambda + \nabla_\mu L(\lambda(\mu)) = \lambda + J_{\lambda\to\mu}^{-1} \nabla_\lambda$ where

\begin{aligned}J_{\lambda\to\mu}&= \nabla_\lambda \mu(\lambda)\\&= \nabla^2_\lambda A(\lambda) \end{aligned}

is the Fisher information matrix, because exponential families are log-linear, so the Hessian of the log-likelihood is constant and equal to the Fisher (which would ordinarily be the expected Hessian):

\begin{aligned}F&=-E_q[\nabla_\lambda^2 \log q_\lambda(x)]\\ &=-E_q\left[\nabla_\lambda^2 \langle \lambda, t(x)\rangle - A(\lambda)\right ]\\ &=-E_q\left[-\nabla_\lambda^2 A(\lambda)\right]\\ &=\nabla_\lambda^2 A(\lambda)\end{aligned}

So we conclude that natural gradient on $\lambda$ is mirror descent on $\mu$ .

Is the converse of this still helpful? Mirror descent in $\lambda$ is natural gradient in $\mu$ . It might sometimes be nice in practice to do descent on $\mu$ because it's numerically better-behaved. But we'd have to take gradients with respect to $\lambda$ anyway, which I think defeats the point?

mirror descent

Links to this note

constrained optimization

policy gradient

gradient of the log normalizer

mirror descent implementations

Meta