Created:
Modified:

mixture of experts

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A mixture-of-experts model consists of a set of functions $f_1(x), f_2(x), \ldots$ , the 'experts', and a gating function $g(x)$ that determines how to select which expert(s) to use for a given input and how to combine them.

If the experts produce probabilistic predictions, we would typically combine these by taking the mixture distribution of their individual predictive distributions, with mixture weights determined by the gating network.

the choice of expert is a discrete latent variable

One way to view a mixture of experts is as a kind of inverse of a product of experts model. Product of experts models occur naturally when we have many models --- many maps of the same territory --- so that each map provides constraints on the territory, and taken together we can attempt to 'decode' the territory as the product of all these constraints. By contrast, the inverse 'encoding' function from the territory to a given map / representation will be different depending on the map we choose, so for a collection of maps we'll have a collection of possible encoding functions, and for any particular task requiring a particular representation we may only need to run one of them (planning a road trip requires a road map, a hiking trip a topo map, etc). This is exactly what a mixture-of-experts model does.

mixture of experts

Links to this note

sparse mixture of experts

expressive transformer

Meta