Created:
Modified:

vision transformer

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Ref:

https://arxiv.org/abs/2010.11929

We start by chunking an image into patches, and concatenating each patch with a position embedding. Say we have 256 patches embedded as pixels + position vectors. We may map these through a trainable (linear) embedding into a higher-dimensional token space. These are essentially 'raw image tokens'.

These patches now become the input to a transformer model, which eventually produces 256 vectors representing the image. These are 'processed image tokens'.

At least in the original vision transformer paper, the position embeddings were learned. So the 2D structure is not built in at all.

For the original supervised setup, we add another learnable token to the patches input to the network (so there are really 257 tokens in/out?). The output of this position is fed into a classifier head that produces the image label. The training signal flows backwards from this classifier head.

vision transformer

Links to this note

multimodal transformer

Meta