Created: May 02, 2021
Modified: January 24, 2022
Modified: January 24, 2022
Transformer Papers
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.- Massive list here: https://github.com/cedrickchee/awesome-bert-nlp
- Bahdanau, Cho, Bengio. Neural Machine Translation by Jointly Learning to Align and Translate (2016). https://arxiv.org/abs/1409.0473.
- The original attention paper.
- Vaswani et al. Attention is All You Need (NeurIPS, 2017)
- http://jalammar.github.io/illustrated-transformer/
- http://www.peterbloem.nl/blog/transformers
- http://jalammar.github.io/illustrated-bert/
- Images:
- Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). https://arxiv.org/abs/2010.11929
- Caron et al. Emerging Properties in Self-Supervised Vision Transformers (2021) https://arxiv.org/abs/2104.14294
Michael Nielson asks for references to understand transformers: https://treeverse.app/view/hlx59HYB Interesting replies:
- this paper: https://proceedings.neurips.cc/paper/2021/hash/d0921d442ee91b896ad95059d13df618-Abstract.html
- ch 9 of jurafsky's textbook, on deep seq2seq models
- @gwern: "You're looking for a "You Could Have Invented Transformers", but there's not really any such thing. I think in large part because most of what makes Transformers Transformers is not that important or useful, and that's why MLP-Mixers et al (much easier to understand!) work."
- @moultano: "I think the best way to think about it is "How would I implement a differentiable hash table that uses dot products to do lookups?" And that's basically a layer of a transformer."