Created: October 30, 2020
Modified: October 30, 2020
Modified: October 30, 2020
probabilistic transformers
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.- A short note on interpreting a transformer layer as performing maximum-likelihood inference in a Gaussian mixture model: https://arxiv.org/abs/2010.15583
- The correspondence is as follows:
- The mixture 'components' are the units at each layer.
- Each unit has an associated distribution over queries and values, which is assumed to be factored Normal: where and are the key and expected value vectors for unit .
- Recall that in a transformer, each unit produces a key, a value, and a query for its counterpart at the next layer.
- Here, each unit has a known key , input value , and sampled query . The output value is unknown.
- This model assumes that the query is randomly drawn from a Normal around the key at each layer---is that reasonable??
- Given a query where we don't know what unit it came from, we want the most probable value for that query. That means we:
- compute 'weights' for each unit according to how likely they are to have generated .
- take the weighted expectation of
- This is pretty much exactly the transformer update equation (with some differences in how the weights are computed---the gaussian form uses squared distance between query and key, while the actual transformer uses dot products, these are of course related by the norms of the query and key, which can be encoded as 'priors' to make the two forms equivalent).
- I'm not super impressed by this. The model and the query feel pretty artificial.