Modified: September 25, 2023
perceiver
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.reading the perceiver papers from Deepmind:
- Perceiver: Jaegle et al 2021 https://arxiv.org/abs/2103.03206
- Perceiver-IO: Jaegle et al 2022 https://arxiv.org/abs/2107.14795
- Perceiver AR: https://arxiv.org/abs/2202.07765
The basic architecture is a latent 'residual stream' that can attend to parts of the input in turn. It interleaves cross-attention modules --- where the 'global' residual generates a query, matched against keys and values generated from an input byte array --- with a standard transformer block on the latents.
I think this is, roughly, an 'attention RNN' (if weights are shared between steps, which they can be but don't have to be). In an RNN we just feed in the input at each step. An LSTM steps this up by using gates to determine what parts of the input and residual to keep (?). The perceiver uses QKV attention to bring in part of the input at each step. And the latent steps are also (attention-based) transformers.
What is the shape of these computations?
- input byte array is . For low-res ImageNet, M is pixels, and is a number of channels. I imagine this would be something like 3+16 channels, for something like RGB plus dimensions of Fourier positional embeddings?
- latent vector is .
- so we get a key and value from each pixel in the image? and a query for each 'index' of the latent. So each latent 'index' will create an image-shaped attention map.
They end up with eight steps of cross-attention, ie, the model 'looks at' the image eight times. Weights are shared for all but the first step of cross-attention and latent attention.
Perceiver IO generalizes this to create arbitrary outputs via attention. Each element of the output generates a query to the latent, which provides keys and values.
One way to think about Perceiver is that it's like a transformer, except instead of requiring the residual stream to have the same dimension as the input (and be initialized with the input), we make it latent so that it can have a lower dimension. Then we initialize it with some function of the input (the first set of cross-attention weights, combined with the initial learned latent). But we also allow it to 'look back' at the input at several points.
And this is at the level of potentially a whole sentence, so in principle we can represent an arbitrarily long sequence of words with a fixed-size latent.
Perceiver AR is interesting because it's in the same family conceptually, but it's really just a transformer normal transformer on a length-N sequence, that's allowed to look at a much larger length-M input sequence at the first layer. This is 'linear' in the length of the larger context, as long as it's a given prompt (versus actually generating a sequence of length M, which would pay M work at each step, so be M^2, but still better than M^3 with a standard transformer).
The AR and IO perceiver work is kind of unsatisfying as a 'global workspace' because they don't even use iterative cross-attention. The input is encoded once, at the beginning, and then it's just a transformer in latent space.