Some ideas from this paper. Binary positional encodings . Memory index is represented by a binary vector . This has the property that…
This note is a scratchpad for investigating the expressivity of the [ transformer ] architecture. In general, one set of intuitions that we…
There are a few ways to do this. Google's PaLM uses rotary embeddings so it seems like that's probably close to the state of the art? But…
Suppose we want a [ transformer ] to evaluate the inequality returning if and otherwise. For integer , this can be done with a…
References: Jacobs, Jordan, Nowlan, Hinton. Adaptive Mixtures of Local Experts (1991) Shazeer et al. Outrageously Large Neural Networks…
How should a machine learning model represent text? Word-level and character-level features are obvious options, but both have drawbacks…
In developing intuition about [ transformer ]s it's useful to think about specific primitive operations that can be implemented by a small…
The core of the transformer architecture is multi-headed [ attention ]. The transformer block consists of a multi-headed attention layer…