Created: December 14, 2023
Modified: December 20, 2023

gate

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Examples recommended by GPT4:

Long Short-Term Memory (LSTM):
- Paper: "Long Short-Term Memory" by Sepp Hochreiter and Jürgen Schmidhuber (1997).
- Significance: This seminal paper introduced LSTM, a type of recurrent neural network (RNN) architecture specifically designed to address the vanishing gradient problem in standard RNNs. It introduced the concept of gates to control the flow of information.
Gated Recurrent Units (GRU):
- Paper: "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" by Kyunghyun Cho et al. (2014).
- Significance: This paper presented the GRU, a variation of LSTM. GRUs simplified the gating mechanism in LSTMs and have been shown to perform comparably in many tasks, often with faster training times.
Gated Convolutional Networks:
- Paper: "Conditional Image Generation with PixelCNN Decoders" by Aäron van den Oord et al. (2016).
- Significance: This work introduced gated convolutional layers in the context of generative models. The gating mechanism in convolutional neural networks (CNNs) helped in modeling the dependencies in the data more effectively.
- Paper: "Language Modeling with Gated Convolutional Networks" by Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier (2016). https://arxiv.org/abs/1612.08083v3
Attention Mechanisms:
- Paper: "Attention Is All You Need" by Ashish Vaswani et al. (2017).
- Significance: While not a gated mechanism in the traditional sense, the introduction of the Transformer architecture, which is based on self-attention mechanisms, has been a pivotal moment in the evolution of neural networks. It has significantly influenced subsequent models in various domains.
Gated Multilayer Perceptrons (MLPs):
- Paper: "Highway Networks" by Rupesh K. Srivastava, Klaus Greff, and Jürgen Schmidhuber (2015).
- Significance: This paper introduced the concept of highway networks, where gated units enable the training of very deep networks by adaptively carrying information across layers. It laid the groundwork for more complex gated MLP structures.

Gated Linear Units improve transformers by Noam Shazeer (https://arxiv.org/abs/2002.05202, 2020): shows that adding gating to the first FFN layer of a transformer tends to improve perplexity on a range of tasks. ReGLU and SwiGLU (using relu or Silu/swish activations instead of sigmoid for the gates) seemed to work the best. This paper ends with the now-classic quote: "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."

gate

Links to this note

multiplicative interaction

Meta