compositional natural gradient: Nonlinear Function
Created: September 26, 2020
Modified: October 30, 2020

compositional natural gradient

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • TO READ:
  • ORIGINAL WRITEUP:
    • The general idea is to implement something like pushforward natural gradient efficiently and automatically.
    • When optimizing a variational bound, I might want to precondition by the Fisher of the predictive distribution p(xϕ,ψ)=pϕ(xz)qψ(z)dzp(x | \phi, \psi) = \int p_\phi(x | z)q_\psi(z) dz, that is,
      Fϕ,ψ=Ex[2logpϕ(xz)qψ(z)dz]\mathcal{F}_{\phi,\psi} = \mathbb{E}_x\left[\nabla^2 \log \int p_\phi(x | z)q_\psi(z) dz\right]
      or the expected Hessian of the conditional term in the ELBO:
      Ez,x[ϕ,ψ2logpϕ(xzψ(ϵ))]\mathbb{E}_{z, x}\left[\nabla_{\phi, \psi}^2 \log p_\phi(x | z_\psi(\epsilon))\right]
      where zψ(ϵ)z_\psi(\epsilon) is a reparameterized sample of zz.
    • We are ultimately optimizing some loss fθ(x)f_\theta(x) that depends on θ\theta only through p~\tilde{p}. Suppose we're in a Monte Carlo setting where we compute a stochastic gradient by:
      • sampling z~qθ(z)=zθ(ϵ)\tilde{z} \sim q_\theta(z) = z_\theta(\epsilon) (where zz reparameterizes qθq_\theta in terms of standard randomness ϵ\epsilon)
      • computing f~(x)=logp(xz)logp(z)+logqθ(z)\tilde{f}(x) = -\log p(x | z) - \log p(z) + \log q_\theta(z), a stochastic estimate of the negative ELBO.
      • Now the gradient to θ\theta is of the form
        df~dθ=ddθlogp(xz)ddθlogp(z)+ddθlogqθ(z)=z~dz~dθ(logp(xz~)+logp(z~))+(θ+z~dz~dθ)logqθ(z~)\begin{aligned}\frac{d\tilde{f}}{d\theta} &= - \frac{d}{d\theta} \log p(x | z) - \frac{d}{d\theta} \log p(z) + \frac{d}{d\theta} \log q_\theta(z)\\ &= -\frac{\partial}{\partial\tilde{z}}\frac{d\tilde{z}}{d\theta} \left(\log p(x | \tilde{z}) + \log p(\tilde{z}) \right) + \left(\frac{\partial}{\partial\theta} + \frac{\partial}{\partial\tilde{z}}\frac{d\tilde{z}}{d\theta}\right) \log q_\theta(\tilde{z}) \end{aligned}
        • (where each derivative operator is really lazy notation for a vector-jacobian product)
      • The 'at each stage' proposal is to replace each dz~dθ\frac{d\tilde{z}}{d\theta} with the natural gradient relative to qθ(z)q_\theta(z), and to replace each z~\frac{\partial}{\partial\tilde{z}} with the natural gradient wrt p(xz)p(x | z).
        • Does this make any sense? In particular, is it equivalent to preconditioning by the Fisher of the pushforward distribution at the last step? (which is, by assumption, a reasonable thing to do).
        • Let's simplify by ignoring the θ\frac{\partial}{\partial\theta} operator, i.e., we'll use the 'sticking the landing' estimator. Then the question is: how does Fq1Jzθ(x)dθFp(xz)1ΔF_q^{-1} \frac{J z_\theta(x)}{d\theta} F_{p(x|z)}^{-1} \Delta, the transformation of a gradient Δ\Delta through the two individual Fisher preconditioners sandwiched around the Jacobian of the reparameterized sample, compare to the 'pushforward' natural gradient Fp~θ(x)1ΔF_{\tilde{p}_\theta(x)}^{-1} \Delta?
        • See Gaussian case at https://colab.research.google.com/drive/1GskCqQE3wFXlbFUobYnhOxo23W-6AiMi#scrollTo=P6q8gGm25iiR . It turns out that this can even be better than pushforward since the latter puts the expectation over z inside the log, while the ELBO (which we are minimizing) keeps it outside the log, so the pushforward is more like the Hessian of the ELBO.
  • COMPOSITIONAL HESSIANS: We might first ask how to compute or condition by the quantity with all expectations removed; the Hessian 2f(ϕ,ψ)=2logpϕ(xzψ(ϵ))\nabla^2 f(\phi, \psi) = \nabla^2 \log p_\phi(x | z_\psi(\epsilon)).
    • In general we won't be able to work with the whole thing, but it may be tractable to approximate as block-diagonal.
    • The KFAC approach is, roughly, to approximate the Hessian as block-(tri)diagonal, where each block is a Kronecker product of factors, and these block matrices can be efficiently inverted. Computationally, we would compute the full gradient, use it to update the terms in our Hessian approximation, invert the blocks efficiently, and use the result to precondition the gradient. In the simplest block-diagonal case, each block takes a step Hii1gi\mathbf{H}_{ii}^{-1} g_{i}.
    • Can we see this as compositional? TODO. Can we somehow get the step for each block 'locally' using custom gradients?