Bayesian learning rule: Nonlinear Function
Created: September 07, 2020
Modified: July 18, 2021

Bayesian learning rule

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • See https://emtiyaz.github.io/papers/learning_from_bayes.pdf
  • Suppose we have a learning problem
    minq(w)QL(q)=Eq(w)[i=1N(yi,fw(xi))]+KL[q(w)p(w)]\min_{q(w)\in\mathcal{Q}} \mathcal{L}(q) = E_{q(w)}\left[\sum_{i=1}^N\ell(y_i, f_w(x_i))\right] + KL\left[q(w) \| p(w)\right]
  • For some choice of exponential-family approximating family Q\mathcal{Q}, the Bayesian learning rule is:
    λt+1=(1ρt)λtρtμEqt[ˉ(w)]\lambda_{t + 1} = (1 - \rho_t)\lambda_t - \rho_t \nabla_\mu E_{q_t}\left[\bar{\ell}(w)\right]
    , where:
    • λt\lambda_t are the natural parameters of qt(w)Qq_t(w) \in\mathcal{Q}
    • μ\mu are its expectation parameters
  • What are the implications for TFP of this rule? Should it affect how we do variational inference?
    • Suppose we had a set of distributions in natural parameterization. We can literally do the autodiff trick as in the Bayesian learning rule to get natural gradients with respect to those natural parameters.
    • What does this get us?
      • Well, suppose we do Gaussian VI. Then this gives us something like Adam that doesn't have to (or doesn't get to) fit the scale parameter separately from the loc parameter. I guess this is a memory savings, maybe it works better? I could also imagine that the scale-of-scale parameter is a useful auxiliary variable and that getting rid of it hurts optimization. I think it's an empirical question how much this is worth.
      • But now suppose we do something like ASVI on an LDA model that has lots of exponential-family distributions, and our surrogate is a big joint exponential family. This must be useful. We'd effectively be allowing the optimizer to use the joint structure of our model. No built-in optimizer can do that. This seems cool!
        • Does this actually work? Say I have a Gaussian chain. Let's say we also take a Gaussian chain as our posterior class. (I believe that ASVI lets you do this). Then our sampling distribution and natural gradients at each node depend on our conditional belief about that node, not just a marginal belief.
    • What are the perils of using natural parameters?
      • Converting between parameterizations can be expensive and numerically unstable.
      • Not clear how to enforce parameter constraints. There are often constraints on the natural parameter space; e.g., the α\alpha and β\beta of the Beta distribution must be positive. And we're not allowed to use a TransformedVariable to enforce this, because it screws up the geometry of the natural gradients. So to apply the rule, we'd need a projection step or a line-search. Now we could do that---we have code for line searching, and/or it wouldn't be hard to add a constraint or projection function to every exponential family distribution we annotate. It wouldn't be 'just SGD' any more, but maybe that's good---the modularity with non-SGD optimizers allowed by the generic optimization framework is just a source of confusion.
  • Is the Bayesian learning rule a good idea? It seems dangerous to conduct your optimization according to your current posterior estimate when you know your current posterior estimate is probably bad. But maybe not that dangerous, if you start at the prior? Then you 'shrink' towards the posterior, which isn't crazy, though also you're doing stochastic optimization with discrete steps and projections and god knows what will happen.
  • It's hard to get too excited about conjugate VI when most of the optimizations we really care about are big and deep. Using 'natural gradient' structure is great, but in most cases it's not clear there's a much better structure to impose than Adam already uses.
  • I could see this as a nice way to implement something like KFAC maybe. But how much do we really care?