Modified: March 14, 2022
mode-covering variational inference is incoherent
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.I have a strong opinion weakly held that doesn't seem to be wildly shared in the approximate Bayesian inference community: reverse (or 'mode-seeking') KL divergence is the right objective for variational inference, and other objectives are essentially dead ends.
The general reason to consider other objectives such as forward ('mode-covering') KL, Renyi divergences including VI, various - and -divergences, etc. is that they try to cover the entire posterior rather than just a single mode. If given the choice, they will overestimate uncertainty rather than underestimating it. In many applications (including AI safety) we might plausibly prefer an underconfident system to an overconfident system.
A roughly equivalent fact is that mode-seeking objectives produces evidence lower bounds, because the approximate posterior considers only a subset of explanations for the data, while mode-covering objectives produce upper bounds, because the approximate posterior tends to cover all possible explanations for the data and also some impossible ones. Having both upper and lower bounds allows us to sandwich the true evidence, which is attractive from a mathy perspective.
The problems with mode-covering objectives are twofold: first, they can't be computed or optimized reliably in general, and even if they could, the resulting 'posteriors' are not useful for real-world decision making.
Both problems stem fundamentally from the nature of reasoning about uncertainty in high-dimensional spaces (i.e., the real world): there are exponentially many ways the world could be, and we don't have the computational resources to consider all of them. The success of methods such as monte carlo tree search reflects the importance of selectively attending to high-probability states in order to get anything done; computation is important and we can't afford to waste it on low-probability states.
In high dimensions, a mode-covering will put the vast bulk of its mass on hypotheses with near-zero posterior probability. Any finite number of samples we draw will be a very poor representation of the true posterior. Not only will our samples include a lot of low-probability hypotheses, they might still miss important modes because, although covered by , these modes represent a relatively small fraction of its mass. This makes the mode-covering of limited use for real-world decision making.
Even from the perspective of pure stochastic optimization, all divergences are (in general) approximated by sampling from some proposal distribution, which defines the 'flashlight beam' that we shine onto the posterior landscape to direct our attention towards a subset of hypotheses. Any variational objective is necessarily based on how much we like what we see under that beam. Since by assumption we can't sample from the true posterior , the only viable proposal distribution is our approximate posterior . The problem is that mode-covering divergences depend very strongly on evaluating the parts of the space that we can't see using the current . A divergence defined as an expectation under the full posterior 'wants' to be evaluated in that light, but that's not possible.
- but under a mode-covering posterior, we will shine a light broadly over the whole space. isn't that fine?
- no, because in high-dimensional spaces most of the photons (samples) will go towards impossible hypotheses. the light does in theory hit the good hypotheses, but any finite-sample approximation of it won't.
Potential counterexample: generative models
On the other hand, maximum likelihood training of generative models is equivalent to forward KL, and after a few years of refinement it is starting to work well: diffusion model models are matching the sample quality of GANs while covering much more of the data distribution. Does this disprove my argument? We might expect that any form of model misspecification should make these models exponentially likely to sample things that are not part of the data distribution, but in fact, they seem to fit quite well.
The difference is that because these models are trained on empirical data, the attention of the training process is kept tightly focused on the actual data distribution. The results show that our models really can represent complex high-dimensional distributions, but we need to train them using samples from said distributions. Sampling from an untrained model (as you'd do in a typical variational training procedure) is still exponentially unlikely to attend to anything useful. Note that importance weighting doesn't fix this, since on its own it can only reweight existing samples.
This implies that we might be able to fit mode-covering surrogate posteriors if only we could generate high-quality posterior samples to train on. This is kind of circular: presumably the reason we're doing VI is because we can't already generate such samples. But it could work to use MCMC steps to direct our attention incrementally towards the true posterior, since we can avoid the curse of dimensionality with gradient-based methods such as HMC. This is essentially the argument for why resample-move particle filtering can avoid degeneracy.
TODO: where do wake-sleep methods fit into this?
Thoughts
Q: what about generative flow networks?
None of this is to deny that overdispersed posteriors and upper-bound minimization might be useful in 'tame' applications with simple probabilistic models, where the posterior is low-dimensional and/or unimodal. But if VI is to be useful for general intelligence (and I'm not sure that it will, again because computation is important and that argues against explicit models of uncertainty), then seeking overdispersed posteriors is not going to be tractable.