Created: September 26, 2020
Modified: October 30, 2020

compositional natural gradient

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

TO READ:
- Kronecker factored Approximate Curvature: K-FAC
- Practical Gauss-Newton: gives recursions for computing Hessian blocks, and Generalized Gauss-Newton blocks.
- Higher-order autodiff: gives a general recursion for computing Hessians.
ORIGINAL WRITEUP:
- The general idea is to implement something like pushforward natural gradient efficiently and automatically.
- When optimizing a variational bound, I might want to precondition by the Fisher of the predictive distribution $p(x | \phi, \psi) = \int p_\phi(x | z)q_\psi(z) dz$ , that is, $\mathcal{F}_{\phi,\psi} = \mathbb{E}_x\left[\nabla^2 \log \int p_\phi(x | z)q_\psi(z) dz\right]$ or the expected Hessian of the conditional term in the ELBO: $\mathbb{E}_{z, x}\left[\nabla_{\phi, \psi}^2 \log p_\phi(x | z_\psi(\epsilon))\right]$ where $z_\psi(\epsilon)$ is a reparameterized sample of $z$ .
- We are ultimately optimizing some loss $f_\theta(x)$ that depends on $\theta$ only through $\tilde{p}$ . Suppose we're in a Monte Carlo setting where we compute a stochastic gradient by:
  - sampling $\tilde{z} \sim q_\theta(z) = z_\theta(\epsilon)$ (where $z$ reparameterizes $q_\theta$ in terms of standard randomness $\epsilon$ )
  - computing $\tilde{f}(x) = -\log p(x | z) - \log p(z) + \log q_\theta(z)$ , a stochastic estimate of the negative ELBO.
  - Now the gradient to $\theta$ is of the form
    $\begin{aligned}\frac{d\tilde{f}}{d\theta} &= - \frac{d}{d\theta} \log p(x | z) - \frac{d}{d\theta} \log p(z) + \frac{d}{d\theta} \log q_\theta(z)\\ &= -\frac{\partial}{\partial\tilde{z}}\frac{d\tilde{z}}{d\theta} \left(\log p(x | \tilde{z}) + \log p(\tilde{z}) \right) + \left(\frac{\partial}{\partial\theta} + \frac{\partial}{\partial\tilde{z}}\frac{d\tilde{z}}{d\theta}\right) \log q_\theta(\tilde{z}) \end{aligned}$
    - (where each derivative operator is really lazy notation for a vector-jacobian product)
  - The 'at each stage' proposal is to replace each $\frac{d\tilde{z}}{d\theta}$ with the natural gradient relative to $q_\theta(z)$ , and to replace each $\frac{\partial}{\partial\tilde{z}}$ with the natural gradient wrt $p(x | z)$ .
    - Does this make any sense? In particular, is it equivalent to preconditioning by the Fisher of the pushforward distribution at the last step? (which is, by assumption, a reasonable thing to do).
    - Let's simplify by ignoring the $\frac{\partial}{\partial\theta}$ operator, i.e., we'll use the 'sticking the landing' estimator. Then the question is: how does $F_q^{-1} \frac{J z_\theta(x)}{d\theta} F_{p(x|z)}^{-1} \Delta$ , the transformation of a gradient $\Delta$ through the two individual Fisher preconditioners sandwiched around the Jacobian of the reparameterized sample, compare to the 'pushforward' natural gradient $F_{\tilde{p}_\theta(x)}^{-1} \Delta$ ?
    - See Gaussian case at https://colab.research.google.com/drive/1GskCqQE3wFXlbFUobYnhOxo23W-6AiMi#scrollTo=P6q8gGm25iiR . It turns out that this can even be better than pushforward since the latter puts the expectation over z inside the log, while the ELBO (which we are minimizing) keeps it outside the log, so the pushforward is more like the Hessian of the ELBO.
COMPOSITIONAL HESSIANS: We might first ask how to compute or condition by the quantity with all expectations removed; the Hessian $\nabla^2 f(\phi, \psi) = \nabla^2 \log p_\phi(x | z_\psi(\epsilon))$ .
- In general we won't be able to work with the whole thing, but it may be tractable to approximate as block-diagonal.
- The KFAC approach is, roughly, to approximate the Hessian as block-(tri)diagonal, where each block is a Kronecker product of factors, and these block matrices can be efficiently inverted. Computationally, we would compute the full gradient, use it to update the terms in our Hessian approximation, invert the blocks efficiently, and use the result to precondition the gradient. In the simplest block-diagonal case, each block takes a step $\mathbf{H}_{ii}^{-1} g_{i}$ .
- Can we see this as compositional? TODO. Can we somehow get the step for each block 'locally' using custom gradients?

compositional natural gradient

Links to this note

noisy natural gradient as VI

pushforward natural gradient

Meta