Created: October 30, 2020
Modified: October 30, 2020

noisy natural gradient as VI

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

https://arxiv.org/abs/1712.02390
Basic idea: optimizers like Adam and RMSProp already keep track of posterior curvature estimates. These are basically Gaussian posteriors. Can we connect them?
It
Handwavy takeaway: in the case of a Gaussian surrogate posterior, consider the
- log likelihood $\log p(y | w, x)$ with Fisher $E_y[\nabla^2_w \log p(y | w, x) ]$
- That Fisher measures the curvature of the true posterior (ignoring any prior??) $p(w | y, x)$ .
- Let $q_\theta(w)$ be a Gaussian surrogate. If $q$ is a good approximation to the true posterior, then the likelihood Fisher is also a good approximation to the prec parameter. And $q$ -style natural gradient on $\theta$ means preconditioning the mean update using the prec, which is what we would have also done if we'd used the likelihood Fisher.
- Does this imply that compositional natural gradient is bad for posteriors with curvature? Preconditioning twice by the Fisher wouldn't be good.
  - Suppose the true posterior $p(w | x)$ is Normal( $\mu, \Sigma$ ). We'd get that if the likelihood were just $x \sim \text{Normal}(w, \Sigma)$ , with $\mu=x$ . Now we want to learn a posterior $q(w | x) = \text{Normal}(\mu_q, \Sigma_q)$ .
  - The Fisher of our 'loss' $\log p(x | w)$ is $\Sigma^{-1}$ . So it's reasonable to consider the immediate natural gradient: $F^{-1} \nabla_w \log p(x | w) = \Sigma \nabla_w (w - x)^T \Sigma^{-1}(w - x) = (w - x)$ . If we then suppose that $w \sim q(w; \mu_q, \Sigma^q)$ is a posterior sample given by $w = L(\Sigma_q) z + \mu_q$ , and try to do natural gradient updates on $\mu_q$ and $\Sigma_q$ , what happens? We have $\frac{\partial w}{\partial \mu_q} = 1$ (really, the Jacobian is $I$ ), so the preconditioned gradient to $\mu_q$ is $\Sigma_q (w - x)$ . That's not great.
  - How does this relate to the Pushforward natural gradient? There they consider the 'variational predictive distribution' $p(x | \mu_q, \Sigma_q) = \int q(w; \mu_q, \Sigma_q) p(x | w) dw$ , which is Normal( $\mu_q, \Sigma + \Sigma_q$ ). In the case where we are close to the true posterior, $\Sigma_q \approx \Sigma$ , this predictive covariance is just $2\Sigma$ , so using it as a preconditioning Fisher avoids the issue above.
  - The crucial thing is that the pushforward relationship of Gaussian covariances is addition, not any kind of multiplication. :-(

noisy natural gradient as VI

Meta