Created: January 27, 2022
Modified: February 10, 2022
Modified: February 10, 2022
ensemble
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.- Often we think of ensembles in the context of supervised learning: we have some algorithm that learns X -> y mappings, and by running it multiple times we learn multiple X -> y mappings. Then averaging or otherwise aggregating the predicted y's tends to work better than using any single mapping.
- Special cases of this are bagging (where the algorithm is to train a model on a bootstrap subsample of the data), boosting (the algorithm is to use weighted data, and you run it multiple times in sequence adjusting the weights), and stacking (each training run uses a different model type, and then you separately learn the optimal weights for each model).
- But the idea is more general and works for unsupervised learning too. One can ensemble general density estimators, e.g., by taking a mixture, or a product of experts.
- Ensembling is good for the same reason that diversification is: it reduces variance. If there are many models compatible with the data, and your training algorithm effectively chooses one at random, then you've introduced variance into your predictions. Using all compatible models averages away that variance.
- Ensembling also enlarges the hypothesis space. A mixture of decision stumps is much more flexible than a single one; a mixture of Gaussians is much more flexible than a single Gaussian.
- Interestingly, while a mixture of Gaussians can represent anything, a product of Gaussians is just another Gaussian. So we'd like to say that mixtures are 'more more powerful'. But this ignores the unique advantages of product models. (as Hinton says, a mixture of experts is at most as sharp as its sharpest expert, while a product of experts is at least as sharp).
- Bayesian connections with ensembling:
- We can think of a Bayesian predictive distribution as an ensemble: if many different mappings within our hypothesis class are compatible with the data, then we should integrate over all of them in making predictions.
- Of course, usually our hypothesis class doesn't contain the true data-generating process. In this case, posterior probability will concentrate on a single, wrong hypothesis.
- The same thing happens at the meta level when doing Bayesian model comparison. Given a set of possible models, posterior probability will concentrate on one of them even if an ensemble would be better. See Bayesian model averaging is not model combination.
- Under the 'enlarged hypothesis space' story, the proper way to treat this case is by defining a prior over ensembles, and doing inference in that model. For example, instead of doing Bayesian model comparison between a Gaussian and a Laplace distribution, just fit a mixture of a Gaussian and a Laplace component, or perhaps a product of experts.