mirror descent: Nonlinear Function
Created: September 07, 2020
Modified: October 03, 2020

mirror descent

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Mirror descent is a framework for optimization algorithms: many algorithms can be framed as mirror descent, and proofs about mirror descent generalize to all of those algorithms.

It fixes the 'type error' in traditional gradient descent: gradients are linear functionals in a dual space, so it's bizarre to add them directly to points in the primal space. We need some kind of map from the dual space into an update we can apply in the primal space. Of course, type systems and dual spaces are human constructions

I've come towards a viewpoint I like: mirror descent is the study of preconditioning by reparameterization. That is, in general we can precondition an optimization by a positive-definite matrix, or more generally, a smoothly varying set of posdef matrices A(x)A(x). This means we take gA(x)=A(x)xf(x)g_A(x) = A(x)\nabla_x f(x) as our updates. An 'optimal' AA in the quadratic case would be the inverse Hessian A(x)=x2f(x)A(x) = \nabla^{-2}_x f(x), which gives us Newton's method.

How can we choose such matrices? We might think of convex functions: a function is convex if and only if its Hessian is everywhere positive definite. So a convex function defines a space of preconditioners, by its second derivatives / Hessian.

  • And perhaps the converse is also true, that any smooth preconditioning function A(x)A(x) can be integrated twice to give a convex function?
  • Certainly in 1D, for any smooth positive-valued function, we can integrate twice to get a convex function. For example, f(x)=1f(x) = 1 integrates twice to give x2x^2.
  • In higher-d, it's not true that all vector fields are the gradients of a scalar potential; the ones that are 'conservative' vector fields, and satisfy properties like, the path integral between any pair of points depends only on the points, not the path. Is there an equivalent notion of when a 'matrix field' is generated by the Hessians of a scalar function?

So choosing a convex function Φ\Phi is choosing a set of preconditioners. Then mirror descent observes that taking a mirror step

μ=μ+αxf(x)\mu' = \mu + \alpha \nabla_x f(x)

is equivalent to taking a step preconditioned by the inverse Hessian of Φ\Phi

x(μ)=x(μ+αxf(x))Taylorx(μ)+J(μx)(μμ)=x+αJ(μμΦ(μ))xf(x)=x+αμ2Φ(μ)xf(x)=x+αx2Φ(x)xf(x)\begin{aligned}x(\mu') &= x\left(\mu + \alpha \nabla_x f(x)\right)\\ &\approx_\text{Taylor} x(\mu) + J_{(\mu \to x)} \left(\mu' - \mu\right)\\ &= x + \alpha J_{(\mu \to \nabla_\mu \Phi^*(\mu))} \nabla_x f(x)\\ &=x + \alpha \nabla^2_\mu \Phi^*(\mu) \nabla_x f(x)\\ &=x + \alpha \nabla^{-2}_x \Phi(x) \nabla_x f(x) \end{aligned}

So: mirror descent lets us take steps in the inverse-Hessian-preconditioned space, at the cost only of computing gradients of the mirror function and its conjugate. In that sense it gives us second-order power for first-order cost.

Unfortunately, this can still be costly in cases where the mirror function or its conjugate is something inherently expensive, like an entropy or log-normalizer of a general distribution.

  • Exercise: why doesn't this make Newton's method cheaper? Say we have a convex loss f(x) and choose ϕ(x)=f(x)\phi(x) = f(x). Mapping to the mirror space μ=xf(x)\mu = \nabla_x f(x) is cheap, and doing the update in that space is cheap. But how do we translate back to the original space? We need the conjugate of ff, or equivalently the (local) inverse of the gradient map, (xf(x))1(\nabla_x f(x))^{-1}, and getting that is expensive. In general we can get a local inverse by taking a Taylor approximation of the gradient map, and inverting that---but now we're back to reifying the Hessian and inverting it.

  • From Nick Harvey's notes: A mirror descent algorithm is defined by a mirror function Φ(x):DR\Phi(x): \mathcal{D} \to \mathbb{R} where DRn\mathcal{D} \in \mathbb{R}^n. The mirror function must be:

    • strictly convex and everywhere differentiable
    • the 'dual space' {Φ(x);xD}\{\nabla \Phi(x); x \in \mathcal{D}\} of its gradient must equal all of Rn\mathbb{R}^n
    • the gradient Φ(x)\nabla \Phi(x) must diverge to \infty on the boundaries of the domain D\mathcal{D}
  • The mirror function defines a mirror map x^=Φ(x)\hat{x} = \nabla \Phi(x) into the space of linear functionals. This inverse of this mirror map is given by the convex conjugate: x=Φ(x^)x = \nabla \Phi^*(\hat{x}).

  • The mirror descent algorithm to optimize an objective f(x)f(x) with a mirror function Φ\Phi subject to a constraint xXDx \in \mathcal{X} \subseteq D is:

    1. Compute a subgradient gt=f(xt)g_t = \nabla f(x_t) at the current point xtXx_t \in \mathcal{X}.
    2. Apply the subgradient in the 'mirror' space:
      • consider current point in mirror space x^t=Φ(xt)\hat{x}_t = \nabla \Phi(x_t)
      • apply the gradient with learning rate: y^t+1=xt^+αtgt\hat{y}_{t + 1} = \hat{x_t} + \alpha_t g_t
      • move back to the primal space: yt+1=Φ(y^t+1)y_{t+1} = \nabla \Phi^*(\hat{y}_{t+1})
    3. Now project back onto the constraint set X\mathcal{X} using the Bregman divergence DΦ(x,y)=Φ(x)Φ(y)xy,Φ(y)D_\Phi(x, y) = \Phi(x) - \Phi(y) - \langle x - y, \nabla \Phi(y) \rangle:
      • xt+1=argminxXDΦ(x,yt+1)x_{t+1} = \text{argmin}_{x \in \mathcal{X}} D_\Phi(x, y_{t+1})
  • If we take the squared Euclidean norm as our mirror function, Φ(x)=x22\Phi(x) = \|x\|^2_2, then mirror descent reduces to projected subgradient descent.

  • A common mirror function is negative entropy, Φ(x)=ixilogxi\Phi(x) = \sum_i x_i \log x_i for xX=Δnx \in \mathcal{X} = \Delta^{n}. In this case we have DΦ(x,y)=KL[xy]+x1y1D_\Phi(x, y) = KL[x \| y] + \|x\|_1 - \|y\|_1 (generalized KL divergence), and the 'dual space' has the interpretation of log probabilities. Additive updates in the dual space then become multiplicative updates in the primal space. Of course, in practice one would probably implement everything in terms of gradient updates on unnormalized logits?

  • Can I write a mirror descent update in a single line?

    xt+1=argminxXDΦ(x,Φ(Φ(xt)+f(xt)))x_{t+1} = \text{argmin}_{x\in \mathcal{X}} D_\Phi \left(x, \nabla \Phi^*(\nabla \Phi(x_t) + \nabla f(x_t))\right)

    or

    xt+1=argminxX(Φ(x)Φ(Φ(Φ(xt)+f(xt)))xΦ(Φ(xt)+f(xt)),Φ(Φ(Φ(xt)+f(xt))))x_{t+1} = \text{argmin}_{x\in \mathcal{X}} \left( \Phi(x) - \Phi(\nabla \Phi^*(\nabla \Phi(x_t) + \nabla f(x_t))) - \left \langle x - \nabla \Phi^*(\nabla \Phi(x_t) + \nabla f(x_t) ), \nabla \Phi(\nabla \Phi^*(\nabla \Phi(x_t) + \nabla f(x_t)))\right \rangle \right)

or

y^t+1=Φ(xt)+αf(xt)\hat{y}_{t+1} = \nabla \Phi(x_t) + \alpha \nabla f(x_t)
yt+1=Φ(y^t+1)y_{t+1} = \nabla \Phi^*(\hat{y}_{t+1})
xt+1=argminxX(Φ(x)x,y^t+1)x_{t+1} =\text{argmin}_{x\in \mathcal{X}} \left( \Phi(x) - \left \langle x, \hat{y}_{t+1}\right \rangle \right)

and by linearity this is equal to

xt+1=argminxX(Φ(x)x,αf(xt))x_{t+1} =\text{argmin}_{x\in \mathcal{X}} \left( \Phi(x) - \left \langle x, \alpha \nabla f(x_{t})\right \rangle \right)
  • This last form implies that the mirror function Φ(x)\Phi(x) functions as a regularization term. We are trying to find the point xx in our constraint space that maximally aligns with the 'goal' functional y^t+1\hat{y}_{t+1} while minimizing Φ(x)\Phi(x). If Φ(x)=x22\Phi(x) = \|x\|^2_2, this is L2L_2 regularization; if Φ(x)=ixilogxi\Phi(x) = \sum_i x_i \log x_i then this is maximum-entropy regularization.
  • How does mirror descent connect to natural gradient? Let's think back to the variational formulation of natural gradient: we want to find the point that maximizes a linear approximation of ff within a KL-ball around our current point:
    xt+1xt=argmaxdd,f(xt)+λdTFdx_{t+1} - x_t = \text{argmax}_d \langle d, \nabla f(x_t)\rangle + \lambda d^T F d
    where λ\lambda is a Lagrange multiplier, which turns out to be just d=F1f(xt)d^* = F^{-1} \nabla f(x_t)
  • From the Stanford notes, we can consider a general variational problem where we want to take steps measured by the Bregman divergence:
    xt+1=argminxXDΦ(x,xt)+xxt,f(xt).x_{t+1} = \text{argmin}_{x\in \mathcal{X}} D_\Phi(x, x_t) + \langle x - x_t, \nabla f(x_t)\rangle.
    • and the special case of negative entropy for Φ\Phi recovers natural gradient.
    • This is apparently also just mirror descent.
  • In what sense is the Bayesian learning rule mirror descent? We are optimizing an objective L(λ)=Eqλ(x)L(\lambda) = E_{q_\lambda} \ell(x) where qq is an exponential family distribution with natural parameter λ\lambda. Now we take the log-normalizer A(λ)=logλ,t(x)dxA(\lambda) = \log \int \langle \lambda, t(x)\rangle dx as our mirror function. Immediate consequences:
    • The Bregman divergence DA(λ,η)=A(λ)A(η)λη,A(η)D_A(\lambda, \eta) = A(\lambda) - A(\eta) - \langle \lambda - \eta, \nabla A(\eta) \rangle is the KL divergence
      KL[qηqλ]KL[q_\eta \| q_\lambda]
      .
    • The mirror space is that of expectation parameters μ\mu.
    • The mirror descent update would ask us to compute the update μ=μ+λL(λ)\mu' = \mu + \nabla_\lambda L(\lambda). I'm not sure how helpful this is.
  • Let's consider the opposite interpretation. Let AA^* be the mirror function. Then the update is λ=λ+μL(λ(μ))=λ+Jλμ1λ\lambda' = \lambda + \nabla_\mu L(\lambda(\mu)) = \lambda + J_{\lambda\to\mu}^{-1} \nabla_\lambda where
  • Let's consider the opposite interpretation. Let AA^* be the mirror function. Then the update is λ=λ+μL(λ(μ))=λ+Jλμ1λ\lambda' = \lambda + \nabla_\mu L(\lambda(\mu)) = \lambda + J_{\lambda\to\mu}^{-1} \nabla_\lambda where
Jλμ=λμ(λ)=λ2A(λ)\begin{aligned}J_{\lambda\to\mu}&= \nabla_\lambda \mu(\lambda)\\&= \nabla^2_\lambda A(\lambda) \end{aligned}

is the Fisher information matrix, because exponential families are log-linear, so the Hessian of the log-likelihood is constant and equal to the Fisher (which would ordinarily be the expected Hessian):

F=Eq[λ2logqλ(x)]=Eq[λ2λ,t(x)A(λ)]=Eq[λ2A(λ)]=λ2A(λ)\begin{aligned}F&=-E_q[\nabla_\lambda^2 \log q_\lambda(x)]\\ &=-E_q\left[\nabla_\lambda^2 \langle \lambda, t(x)\rangle - A(\lambda)\right ]\\ &=-E_q\left[-\nabla_\lambda^2 A(\lambda)\right]\\ &=\nabla_\lambda^2 A(\lambda)\end{aligned}

So we conclude that natural gradient on λ\lambda is mirror descent on μ\mu.

  • Is the converse of this still helpful? Mirror descent in λ\lambda is natural gradient in μ\mu. It might sometimes be nice in practice to do descent on μ\mu because it's numerically better-behaved. But we'd have to take gradients with respect to λ\lambda anyway, which I think defeats the point?