transformer primatives: Nonlinear Function
Created: February 12, 2023
Modified: February 13, 2023

transformer primatives

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

In developing intuition about transformers it's useful to think about specific primitive operations that can be implemented by a small number of layers.

In most cases these constructions assume a particular input format.

Selection

The selection operation y = where(c, a, b) returns a when the conditionc is satisfied, and otherwise returns b. This operation differs from a conditional branch in that it assumes that both aa and bb are already computed.

This can be implemented in a single feedforward layer by relu selection.

Read / write

The paper Looped Transformers as Programmable Computers presents one-layer constructions:

  1. read: given a positional encoding pip_i (a 'memory index'), copy a value from that point in the sequence to a scratchpad at position p1p_1.
  2. write: copies the value at the scratchpad to some position pip_i.