Modified: December 07, 2023
linear attention
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.tags: [ ] created: 2023-12-07
modified: 2023-12-07
References:
The usual transformer attention mechanism is written as
or equivalently
where
is the matrix of normalized attention scores.
Mechanically, nothing is stopping us from replacing these scores with any positive similarity function .
Linear attention is the case where we choose to be linear in (features of) and :
The advantage of doing this is that it allows attention output values to be computed recurrently, accumulating the key-value outer product matrix across steps. The disadvantage is that we give up some expressivity relative to traditional attention.
How should we understand the loss of expressivity?
can we do multi-query attention with linear attention?? at the final position for each head we end up with KV^T of shape [key_size, value_size]
. then with multiple queries we consider a query matrix of size [num_quer]