Created: September 26, 2020
Modified: October 30, 2020
Modified: October 30, 2020
compositional natural gradient
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.- TO READ:
- Kronecker factored Approximate Curvature: K-FAC
- Practical Gauss-Newton: gives recursions for computing Hessian blocks, and Generalized Gauss-Newton blocks.
- Higher-order autodiff: gives a general recursion for computing Hessians.
- ORIGINAL WRITEUP:
- The general idea is to implement something like pushforward natural gradient efficiently and automatically.
- When optimizing a variational bound, I might want to precondition by the Fisher of the predictive distribution , that is,or the expected Hessian of the conditional term in the ELBO:where is a reparameterized sample of .
- We are ultimately optimizing some loss that depends on only through . Suppose we're in a Monte Carlo setting where we compute a stochastic gradient by:
- sampling (where reparameterizes in terms of standard randomness )
- computing , a stochastic estimate of the negative ELBO.
- Now the gradient to is of the form
- (where each derivative operator is really lazy notation for a vector-jacobian product)
- The 'at each stage' proposal is to replace each with the natural gradient relative to , and to replace each with the natural gradient wrt .
- Does this make any sense? In particular, is it equivalent to preconditioning by the Fisher of the pushforward distribution at the last step? (which is, by assumption, a reasonable thing to do).
- Let's simplify by ignoring the operator, i.e., we'll use the 'sticking the landing' estimator. Then the question is: how does , the transformation of a gradient through the two individual Fisher preconditioners sandwiched around the Jacobian of the reparameterized sample, compare to the 'pushforward' natural gradient ?
- See Gaussian case at https://colab.research.google.com/drive/1GskCqQE3wFXlbFUobYnhOxo23W-6AiMi#scrollTo=P6q8gGm25iiR . It turns out that this can even be better than pushforward since the latter puts the expectation over z inside the log, while the ELBO (which we are minimizing) keeps it outside the log, so the pushforward is more like the Hessian of the ELBO.
- COMPOSITIONAL HESSIANS: We might first ask how to compute or condition by the quantity with all expectations removed; the Hessian .
- In general we won't be able to work with the whole thing, but it may be tractable to approximate as block-diagonal.
- The KFAC approach is, roughly, to approximate the Hessian as block-(tri)diagonal, where each block is a Kronecker product of factors, and these block matrices can be efficiently inverted. Computationally, we would compute the full gradient, use it to update the terms in our Hessian approximation, invert the blocks efficiently, and use the result to precondition the gradient. In the simplest block-diagonal case, each block takes a step .
- Can we see this as compositional? TODO. Can we somehow get the step for each block 'locally' using custom gradients?