Created: February 12, 2023
Modified: February 13, 2023
Modified: February 13, 2023
transformer primatives
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.In developing intuition about transformers it's useful to think about specific primitive operations that can be implemented by a small number of layers.
In most cases these constructions assume a particular input format.
Selection
The selection operation y = where(c, a, b)
returns a
when the conditionc
is satisfied, and otherwise returns b
. This operation differs from a conditional branch in that it assumes that both and are already computed.
This can be implemented in a single feedforward layer by relu selection.
Read / write
The paper Looped Transformers as Programmable Computers presents one-layer constructions:
read
: given a positional encoding (a 'memory index'), copy a value from that point in the sequence to a scratchpad at position .write
: copies the value at the scratchpad to some position .