noisy natural gradient as VI: Nonlinear Function
Created: October 30, 2020
Modified: October 30, 2020

noisy natural gradient as VI

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • https://arxiv.org/abs/1712.02390
  • Basic idea: optimizers like Adam and RMSProp already keep track of posterior curvature estimates. These are basically Gaussian posteriors. Can we connect them?
  • It
  • Handwavy takeaway: in the case of a Gaussian surrogate posterior, consider the
    • log likelihood logp(yw,x)\log p(y | w, x) with Fisher Ey[w2logp(yw,x)]E_y[\nabla^2_w \log p(y | w, x) ]
    • That Fisher measures the curvature of the true posterior (ignoring any prior??) p(wy,x)p(w | y, x).
    • Let qθ(w)q_\theta(w) be a Gaussian surrogate. If qq is a good approximation to the true posterior, then the likelihood Fisher is also a good approximation to the prec parameter. And qq-style natural gradient on θ\theta means preconditioning the mean update using the prec, which is what we would have also done if we'd used the likelihood Fisher.
    • Does this imply that compositional natural gradient is bad for posteriors with curvature? Preconditioning twice by the Fisher wouldn't be good.
      • Suppose the true posterior p(wx)p(w | x) is Normal(μ,Σ\mu, \Sigma). We'd get that if the likelihood were just xNormal(w,Σ)x \sim \text{Normal}(w, \Sigma), with μ=x\mu=x. Now we want to learn a posterior q(wx)=Normal(μq,Σq)q(w | x) = \text{Normal}(\mu_q, \Sigma_q).
      • The Fisher of our 'loss' logp(xw)\log p(x | w) is Σ1\Sigma^{-1}. So it's reasonable to consider the immediate natural gradient:
        F1wlogp(xw)=Σw(wx)TΣ1(wx)=(wx)F^{-1} \nabla_w \log p(x | w) = \Sigma \nabla_w (w - x)^T \Sigma^{-1}(w - x) = (w - x)
        . If we then suppose that wq(w;μq,Σq)w \sim q(w; \mu_q, \Sigma^q) is a posterior sample given by w=L(Σq)z+μqw = L(\Sigma_q) z + \mu_q, and try to do natural gradient updates on μq\mu_q and Σq\Sigma_q, what happens? We have wμq=1\frac{\partial w}{\partial \mu_q} = 1 (really, the Jacobian is II), so the preconditioned gradient to μq\mu_q is Σq(wx)\Sigma_q (w - x). That's not great.
      • How does this relate to the Pushforward natural gradient? There they consider the 'variational predictive distribution' p(xμq,Σq)=q(w;μq,Σq)p(xw)dwp(x | \mu_q, \Sigma_q) = \int q(w; \mu_q, \Sigma_q) p(x | w) dw, which is Normal(μq,Σ+Σq\mu_q, \Sigma + \Sigma_q). In the case where we are close to the true posterior, ΣqΣ\Sigma_q \approx \Sigma, this predictive covariance is just 2Σ2\Sigma, so using it as a preconditioning Fisher avoids the issue above.
      • The crucial thing is that the pushforward relationship of Gaussian covariances is addition, not any kind of multiplication. :-(