Created: September 26, 2020
Modified: October 25, 2020

pushforward natural gradient

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

It's tempting to use natural gradient ascent to optimize a variational distribution. We could also consider using it to optimize the parameters of a probability model, like a neural net, that describes a predictive distribution on observables. These are different distributions.
Suppose we're doing VI over parameters of a predictive probability model. For example, a BNN, or Bayesian inference over the parameters of a time-series model or recommender system, or even a Bayesian linear or logistic regression. How should we optimize the variational distribution?
Suppose we stick within the framework of natural gradient. Which distribution do we measure our progress by? It doesn't seem like it makes sense to think solely about the distribution on parameters. We might care more about the pushforward or predictive distribution on observables. That is, given our variational distribution $q_\theta(z)$ and the model $p(x | z)$ , we consider the variational predictive distribution $\tilde{p}_\theta(x) = \int p(x | z) q_\theta(z) dz$ .
Some questions arise:
1. Is there an efficient method for taking steps in $\theta$ following the natural gradient of $\tilde{p}_\theta$ ?
2. Does doing this actually give us a good optimization procedure? (in terms of convergence rates or on practical problems)
Update: it looks like this is exactly explored by https://arxiv.org/abs/1903.02984. I need to read it. Some questions:
- can they approximate the fisher effectively?
  - it looks like they sample q and p(x | z) with reparameterization gradients to compute the covariance-of-gradients approximation. so this works 'generically', although you do need to sample from the likelihood. I wonder how/if the empirical Fisher would work here.
  - this also involves materializing the Fisher matrix---they use KFAC for a VAE experiment.
- where does it actually work better? by how much?
  - looks like it works in cases where the likelihood has strong curvature.
- would this work for BNNs? yeah, but you need something like KFAC to approximate a Fisher matrix.
Second update: compositional implementation is also closely related to work on Hessian-free and Generalized Gauss Newton optimization for neural nets.
NameRedacted says: maybe think about the conditional entropy H(X, Y) = H(X) + H(Y | X)
- maybe related: https://arxiv.org/abs/1702.03656
Some things we might naturally try:
- apply the relevant Fisher preconditioners at each stage. compositional natural gradient
  - a meta-question is, how important is compositionality here? specifically:
    - when would we compose more than two stages?
      - we could have multiple stochastic layers
    - does this let us get behavior 'for free' by building a model from natural-gradient parts?
      - specifically, we would need a 'likelihood part' such that gradients of log p(x | z) turned into natural gradients (presumably assuming x is sampled given z, otherwise empirical natural gradients).
      - it's easy to imagine this for simple distributions. But what if there's a linear link (as in linear/logistic regression) or a nonlinear link? what if we have multiple observed variables?

pushforward natural gradient

Links to this note

compositional natural gradient

Meta