Modified: October 17, 2022
multiplicative interaction
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.From a conversation I had about attention mechanisms in deep architectures. Maybe that terminology is too suggestive --- it's just a case of a multiplicative interaction, and we should focus on the mathematical operations at play.
'multiplicative structure' is another way of saying that a layer gets to rescale its inputs, and thereby choose which dimensions will be salient to downstream computation… sure that's a relaxation of 'hard' attention / pointer networks, which are themselves a gross abstraction of human attention, but it doesn't seem like a crazy metaphor?
Reflecting on this, I still think 'attention' is a good description of what happens in transformers, but multiplicative structure is a broader thing that encompasses all kinds of fast weights. In the classic neural net vocabulary of activations and weights, it makes sense to say that a 'weight' is any quantity that multiplies an activation vector to produce new activations. A traditional 'slow' weight would be constant across all network inputs; relaxing this opens up a wider world of multiplying by input-dependent quantities.
What kinds of mechanisms exist for input-dependent weights?
- Transformer-like 'attention' mechanisms:
- operate in the position axis of a sequence model
- 'weights' for each output are simplex-constrained
- 'weights' arise from an outer product of inputs
- LSTM gates (forget, input, output)
- pointwise multiplication, equivalent to a diagonal weight matrix
- weights are in
[0, 1]
If we think of weights as the 'program' of a network, then input-dependent weights allow networks to run input-dependent programs. For a network to itself be a programmable computer seems significant. But how?
This paper seems very relevant:
Multiplicative Interactions and Where to Find Them https://openreview.net/forum?id=rylnK6VtDH