Created: August 31, 2020
Modified: July 06, 2022

natural gradient

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

We don't typically think of it this way, but you can derive a gradient descent step as finding the point that minimizes a linearized objective within a Euclidean ball around the current point:

\begin{align*} &\min \tilde{f}(x)\\ &\text{s.t. }\|x - x_\text{old}\|^2 \le \epsilon \end{align*}

where

\tilde{f}(x) = f(x_\text{old}) + (x - x_\text{old})^T \nabla f(x_\text{old})

is just the first-order Taylor expansion of the objective.

Sometime we're actually optimizing over the parameters of a probability distribution (including, for example, a neural net that outputs a distribution). Then we might not care so much about Euclidean geometry in parameter space. We might prefer to take the step that minimizes the objective subject to a constraint on the KL divergence between the old and new distributions:

\begin{align*} &\min \tilde{f}(\theta)\\ &\text{s.t. }\mathcal{D}_\text{KL}\left(p_{\theta_\text{old}} \| p_\theta\right) \le \epsilon \end{align*}

TODO other facts:

quadratic approx of KL is symmetric, leads to Fisher preconditioning

old notes (TODO clean up or delete)

Emti Khan has a nice writeup of his 'Bayesian learning rule': https://emtiyaz.github.io/papers/learning_from_bayes.pdf
For an exponential-family Q, the rule is just natural gradient wrt Q.
I've used a similar observation to build loss functions that do 'automatic natural gradient'. These loss functions use 'stop_gradient', so it's really not just a loss function in the conventional sense, but a computation graph including annotations regarding backward flow.
Does this mean that I could get interesting algorithms this way? Suppose I have a Gaussian surrogate. Emti's paper shows that in general, his rule with a Gaussian Q reduces to version of Newton's method where the Hessian is estimated online.
So I think using a MVN 'automatic natural q' as a VI surrogate would implement something like second-order optimization. Of course you pay the cubic cost to convert between natural and expectation parameters,so this isn't unreasonable.
- I think this might provide an easy path to implementing Vadam and similar algorithms with TFP. This might be really nice for variational inference in TFP. Instead of optimizing both parameters and
Reference to using mixture-of-exponential-family distributions and getting out Bayesian ensemble methods. Seems relevant to what NameRedacted is doing? https://arxiv.org/abs/1906.02914
I still want to understand section 2.2 on the connection that 'natural gradient in $\lambda$ is mirror descent in $\mu$ ' and vice versa.

natural gradient

old notes (TODO clean up or delete)

Links to this note

trust region policy optimization

policy gradient

Cramer-Rao bound

pushforward natural gradient

mirror descent

natural gradient implementations

Meta