mirror descent implementations: Nonlinear Function
Created: September 07, 2020
Modified: September 07, 2020

mirror descent implementations

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • What pieces of mirror descent can we automate?
  • Given a mirror function Φ\Phi, we can compute the mirror map Φ\nabla \Phi, and the Bregman divergence DΦ(x,y)D_\Phi(x, y).
  • Can we compute the convex conjugate Φ(y)=supxy,xΦ(x)\Phi^*(y) = \sup_x \langle y, x \rangle - \Phi(x) ? Of course you could optimize, but I don't think we could get an exact version.
  • Can we do Bregman projection argminxXDΦ(x,y)\text{argmin}_{x\in X} D_\Phi(x, y)? Again, of course we can optimize, but we're not going to get this in closed form for arbitrary XX and Φ\Phi.
  • How specifically does exponential-family mirror descent work?
    • Hypothetically we start with a mean param μt\mu_t and dual λt\lambda_t. We compute gt=μL(λ(μ))g_t = \nabla_\mu L(\lambda(\mu)) and then apply the dual update λt+1=λt+αgt\lambda_{t+1} = \lambda_t + \alpha g_t. We then convert this back to
      μt+1=argminμMDA(μ,A(λt+1))=argminμMA(μ)μ,A(A(λt+1))=argminμMA(μ)μ,λt+1\begin{aligned}\mu_{t+1}&= \text{argmin}_{\mu\in \mathcal{M}} D_{A^*} (\mu, \nabla A(\lambda_{t+1}))\\ &= \text{argmin}_{\mu\in \mathcal{M}} A^*(\mu) - \langle \mu, \nabla A^*(\nabla A(\lambda_{t+1}))\rangle\\ &=\text{argmin}_{\mu\in \mathcal{M}} A^*(\mu) - \langle \mu, \lambda_{t+1} \rangle\end{aligned}
      , minimizing the Bregman divergence to the desired natural parameter λt+1\lambda_{t+1} within the space of realizeable marginals.
    • It seems like this assumes that our natural param λt+1\lambda{t+1} was 'valid'? But in general we can get a natural param that doesn't normalize. Is that okay? It seems like the update above would still be well defined.