Created: May 02, 2021
Modified: January 24, 2022

Transformer Papers

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Massive list here: https://github.com/cedrickchee/awesome-bert-nlp
Bahdanau, Cho, Bengio. Neural Machine Translation by Jointly Learning to Align and Translate (2016). https://arxiv.org/abs/1409.0473.
- The original attention paper.
Vaswani et al. Attention is All You Need (NeurIPS, 2017)
http://jalammar.github.io/illustrated-transformer/
http://www.peterbloem.nl/blog/transformers
http://jalammar.github.io/illustrated-bert/
Images:
- Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). https://arxiv.org/abs/2010.11929
- Caron et al. Emerging Properties in Self-Supervised Vision Transformers (2021) https://arxiv.org/abs/2104.14294

Michael Nielson asks for references to understand transformers: https://treeverse.app/view/hlx59HYB Interesting replies:

this paper: https://proceedings.neurips.cc/paper/2021/hash/d0921d442ee91b896ad95059d13df618-Abstract.html
ch 9 of jurafsky's textbook, on deep seq2seq models
@gwern: "You're looking for a "You Could Have Invented Transformers", but there's not really any such thing. I think in large part because most of what makes Transformers Transformers is not that important or useful, and that's why MLP-Mixers et al (much easier to understand!) work."
@moultano: "I think the best way to think about it is "How would I implement a differentiable hash table that uses dot products to do lookups?" And that's basically a layer of a transformer."

Transformer Papers

Meta