Transformer Papers: Nonlinear Function
Created: May 02, 2021
Modified: January 24, 2022

Transformer Papers

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Michael Nielson asks for references to understand transformers: https://treeverse.app/view/hlx59HYB Interesting replies:

  • this paper: https://proceedings.neurips.cc/paper/2021/hash/d0921d442ee91b896ad95059d13df618-Abstract.html
  • ch 9 of jurafsky's textbook, on deep seq2seq models
  • @gwern: "You're looking for a "You Could Have Invented Transformers", but there's not really any such thing. I think in large part because most of what makes Transformers Transformers is not that important or useful, and that's why MLP-Mixers et al (much easier to understand!) work."
  • @moultano: "I think the best way to think about it is "How would I implement a differentiable hash table that uses dot products to do lookups?" And that's basically a layer of a transformer."