Link: A Universal Law of Robustness via Isoperimetry | OpenReview This paper purports to explain (and quantify) the observed fact that…
Modified: March 03, 2022.
In the spirit of [ prediction as a model-building exercise ]. Language modeling: system writes publishable poetry: debatably already…
Modified: April 07, 2022.
See https://emtiyaz.github.io/papers/learning_from_bayes.pdf Suppose we have a learning problem For some choice of exponential-family…
Modified: July 18, 2021.
For any strictly [ convex ] function , define the Bregman divergence: Examples: (Squared) Euclidean distance : choose the squared norm…
Modified: March 07, 2022.
Related to [ natural gradient ] and the [ Fisher information ] matrix. Let's say we have a parametric model of some data. The Cramer-Rao…
Modified: July 05, 2022.
Generates images from captions by combining [ CLIP ]-style representation learning with a [ diffusion model ] model to construct images from…
Modified: April 30, 2022.
Weight-Space View Recall standard linear regression. We suppose and where , where can be augmented with an implicit 1 term to allow a…
Modified: March 16, 2022.
References: Dayan, Hinton, Neal, Zemel (1994) https://www.cs.toronto.edu/~hinton/absps/helmholtz.pdf This paper is one of the first to…
Modified: March 24, 2024.
Good tutorial: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Modified: October 17, 2022.
We're given a [ constrained optimization ] problem Note that the standard formulation of Lagrange multipliers handles only equality…
Modified: April 29, 2023.
Yann Lecun's famous cake analogy is that: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake…
Modified: June 12, 2021.
Gilmer at al. paper 2017 Experiments on QM9. Unlike SMILES strings, includes molecular geometry . General formulation of message passing…
Modified: March 21, 2022.
If two hypotheses are equally consistent with the data, the simpler is more likely to be 'true'. Formally, it is more likely to generalize…
Modified: March 03, 2022.
I'm trying to build my understanding. These are fragments of intuitions. Bayesian inference starts with a prior P and a likelihood. Given…
Modified: January 15, 2021.
In a bandit setting, in each round we see a context , choose an action , and receive a reward sampled from a distribution with some…
Modified: March 26, 2025.
Massive list here: https://github.com/cedrickchee/awesome-bert-nlp Bahdanau, Cho, Bengio. Neural Machine Translation by Jointly Learning to…
Modified: January 24, 2022.
Fundamentally an algorithm is any computational procedure: something that takes in data and spits out some function of that data. Computer…
Modified: November 29, 2022.
Modified: December 01, 2023.
One of the best ideas in machine learning. (I even thought so in 2011!) There are two common mechanisms: 'soft' and 'hard'. In both cases…
Modified: January 24, 2022.
This is my stab at explaining automatic differentiation, specifically backprop and applications to neural nets. A few dimensions to think…
Modified: August 23, 2022.
I think of "variance" as the error in a statistical estimate that comes from not having enough data (assuming an [ identifiable ] model…
Modified: September 28, 2023.
A nice paper that gets at some subtleties of calibration: Daniel D. Johnson, Daniel Tarlow, David Duvenaud, Chris J. Maddison. Experts Don't…
Modified: March 26, 2024.
The [ distinction ] between classification and regression is, from one point of view, arbitrary: it's all just function approximation, and…
Modified: March 12, 2021.
(below was originally an email to SuccessfulFriend, copying here for posterity) learning as compression seems like kind of a folk idea with…
Modified: September 06, 2021.
Arguably the core insight of deep learning / [ differentiable program ]ming is that the shape and structure of the computations we do are so…
Modified: September 13, 2022.
Suppose we want to optimize an objective under some equality and/or inequality constraints, Some general classes of approach we can use are…
Modified: July 07, 2022.
Relevant papers: DIfferentiable compositional kernel learning for Gaussian Processes (Sun et al., 2018) Differentiable Architecture Search…
Modified: March 07, 2020.
Modified: January 16, 2022.
A technique for [ representation ] learning in which semantically similar datapoints are encouraged to have similar representations, and…
Modified: April 30, 2022.
A method for fitting an unnormalized probability density (aka [ energy-based model ]) to data. Note that this is a different and harder…
Modified: May 15, 2021.
References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…
Modified: April 05, 2023.
Current (2021) deep networks require huge datasets in order to [ generalization|generalize ]. But we know that humans can do one-shot…
Modified: May 23, 2021.
References: Holtzman et al. (2020), The Curious Case of Neural Text Degeneration https://arxiv.org/abs/1904.09751 How should we actually…
Modified: July 02, 2022.
see also: [ differentiable program ]
Modified: February 10, 2022.
Notes from John Schulman's Berkeley course on deep [ reinforcement learning ], Spring 2016. Value vs Policy-based learning Value-based…
Modified: February 22, 2022.
Fast differentiable sorting and ranking: https://arxiv.org/abs/2002.08871 What are differentiable analogues of 'standard' programming…
Modified: March 07, 2020.
Diffusion models for image generation were independently invented at least twice: in a discrete-time variational inference framework…
Modified: August 31, 2022.
References: ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus , Causal Inference 2: Illustrating Interventions via a Toy…
Modified: August 06, 2021.
Empirically, as model capacity increases past the memorization threshold ( ), [ generalization|generalization ] error starts decreasing…
Modified: March 02, 2022.
TODO: flesh out theory, understand ADMM (e.g., https://www.cis.upenn.edu/~cis515/ws-book-IIb.pdf )
Modified: July 06, 2022.
Modified: February 10, 2022.
Often we think of ensembles in the context of supervised learning: we have some algorithm that learns X -> y mappings, and by running it…
Modified: February 10, 2022.
Considering training an [ autoregressive ] model of sequence data (text, audio, action sequences in [ reinforcement learning ], etc.), which…
Modified: October 13, 2022.
This note is a scratchpad for investigating the expressivity of the [ transformer ] architecture. In general, one set of intuitions that we…
Modified: October 28, 2022.
Exponential Families, Conjugacy, Convexity, and Variational Inference Any parameterized family of probability densities that can be written…
Modified: May 21, 2022.
On an evolutionary timescale, it's useful to evolve structures that can learn quickly. The nervous system is an evolved organ system for…
Modified: October 27, 2022.
As AGW points out here , it is statistically better to fit a flexible model family, with an inductive bias, than a constrained model family…
Modified: February 15, 2022.
'[ no free lunch theorem ]' arguments are misleading because they consider the space of all possible functions. In fact, we usually care…
Modified: December 14, 2022.
References: Dauphin et al. 2017 https://arxiv.org/abs/1612.08083 Shazeer 2020 https://arxiv.org/abs/2002.05202 https://arxiv.org/abs/240…
Modified: March 03, 2024.
Fundamentally, where does generalization come from? [ causality ]: a model may generalize because it has discovered the true mechanism, or…
Modified: October 02, 2021.
Many objects can be generated by a sequence of actions. For example: Generating language by adding one word at a time Generating a molecule…
Modified: March 13, 2022.
Sutton and Barto use this as a general term for any form of interleaving policy evaluation steps with policy improvement steps. This…
Modified: March 22, 2022.
"What I cannot create, I do not understand". Related to: [ computational complexity ]: provers vs verifiers. [ P != NP ] [ production vs…
Modified: April 27, 2020.
Why do we clip gradients in deep learning? When is it important and what is the right way to do it? It seems like the standard recipe used…
Modified: April 29, 2023.
For a normalized distribution , constructed from an (unnormalized) energy with normalizing constant as a function of parameters , in…
Modified: July 09, 2022.
A 'graph neural net' is a differentiable, parameterized function whose input or output (or both) is a graph. Discriminative: graph as input…
Modified: June 06, 2020.
A nice observation from Percy Liang on the relationship between language modeling and grounded understanding: Just because you don't…
Modified: April 29, 2022.
Closely related to [ discrete latent variable ]s and to [ reinforcement learning ] with discrete actions. If I do a thing and it goes well…
Modified: January 23, 2022.
[ value alignment ] research often frames the problem as: first, learn the human 'value function' --- for every possible state of the world…
Modified: June 17, 2024.
Multiple senses: An 'ill-conditioned matrix' has a large ratio between its largest and smallest eigenvalue (more generally, see what is a…
Modified: December 28, 2022.
[ grokking ] / [ phase change hypothesis ] emergence of near-discrete features in large transformers symmetries / non-[ identifiable…
Modified: September 14, 2022.
Ways to specify inductive bias: Feature engineering Prior distribution acts as regularizer in MAP estimates Graphical model (constraint on…
Modified: January 16, 2022.
multiple senses: in machine learning: positive definite (Mercer) kernels in linear algebra: kernel (nullspace) of a linear map in CS systems…
Modified: February 25, 2022.
If you believe that neural nets basically just memorize the training data, then training larger and larger models is hopeless. The…
Modified: September 06, 2021.
In principle we can apply [ automatic differentiation ] through any composition of differentiable operations. This lets us get gradients of…
Modified: July 21, 2022.
To train a [ transformer ] layer on a sequence of length requires the output of the attention computation where are matrices and is…
Modified: February 19, 2024.
Generally this means training some aspect of the learning procedure itself. There is then an inner-loop learning procedure, which follows…
Modified: October 04, 2021.
Unlike most modern [ deep learning ] systems, humans: don't have separate training/test phases (though we may have wake/[ sleep ]) don't…
Modified: January 16, 2022.
Short descriptions of things, when they exist, must capture some kind of structure. The principle of [ Occam's razor ] posits that we should…
Modified: April 12, 2022.
Mirror descent is a framework for optimization algorithms: many algorithms can be framed as mirror descent, and proofs about mirror descent…
Modified: October 03, 2020.
I have a [ strong opinion weakly held ] that doesn't seem to be wildly shared in the [ approximate Bayesian inference ] community: reverse…
Modified: March 14, 2022.
In any human-to-human interaction, language carries some very important high-order bits, but it can only carry a few bits. It can help…
Modified: June 12, 2021.
From a conversation I had about [ attention ] mechanisms in deep architectures. Maybe that terminology is too suggestive --- it's just a…
Modified: March 03, 2024.
We don't typically think of it this way, but you can derive a [ gradient descent ] step as finding the point that minimizes a linearized…
Modified: July 06, 2022.
Christian Naesseth, Fredrik Lindsten, Thomas Schon (2015): http://proceedings.mlr.press/v37/naesseth15.html The main idea: In an SMC…
Modified: July 14, 2021.
Like the proverbial half-full glass, smart people can look at the same reality of the current capacities of neural nets, and come to…
Modified: April 07, 2020.
The folklore no-free-lunch 'theorem' in machine learning says that, for any pair of learning algorithms, there exists some dataset on which…
Modified: March 04, 2022.
https://arxiv.org/abs/1712.02390 Basic idea: optimizers like Adam and RMSProp already keep track of posterior curvature estimates. These are…
Modified: October 30, 2020.
reading the perceiver papers from Deepmind: Perceiver: Jaegle et al 2021 https://arxiv.org/abs/2103.03206 Perceiver-IO: Jaegle et al 202…
Modified: September 25, 2023.
(see also: [ large models ]) There's a viewpoint that neural nets just memorize the training data, so the more training data you have, the…
Modified: February 10, 2022.
It seems like there is, or can be, a virtuous relationship between privacy and generalization. You don't want to memorize too many…
Modified: February 14, 2021.
Can we think about [ generative flow network ]s as a potentially tractable formulation of probabilistic program induction?! executing a line…
Modified: March 14, 2022.
Many [ probabilistic programming ] researchers frame their work as part of the broader problem of [ artificial intelligence ]. Artificial…
Modified: December 01, 2023.
A short note on interpreting a transformer layer as performing maximum-likelihood inference in a Gaussian mixture model: https://arxiv.org…
Modified: October 30, 2020.
Modified: February 08, 2020.
Introduced by Geoff Hinton (1999): Products of Experts . Each expert produces a probability distribution. These are combined by…
Modified: May 15, 2021.
It's tempting to use [ natural gradient ] ascent to optimize a variational distribution. We could also consider using it to optimize the…
Modified: October 25, 2020.
Note : see [ reinforcement learning notation ] for a guide to the notation I'm attempting to use through my RL notes. Three paradigmatic…
Modified: April 23, 2022.
Suppose we want a [ transformer ] to evaluate the inequality returning if and otherwise. For integer , this can be done with a…
Modified: February 13, 2023.
The selection operation y = where(c, a, b) returns How can a [ transformer ] layer implement this operation? One approach is to is to use…
Modified: February 12, 2023.
If a model with data has normalizing constant , then the replica trick says that This allows us to analyze the average log-normalizer…
Modified: October 22, 2022.
In modern ML, representation learning is the art of trying to find useful abstractions, embodied as encoding networks. We can learn…
Modified: February 11, 2022.
Scheduled sampling is a training procedure for sequence models that attempts to mitigate [ exposure bias ] - the problem in which generation…
Modified: October 13, 2022.
References: https://redwood.berkeley.edu/wp-content/uploads/2020/08/KanervaP_SDMrelated_models1993.pdf A sparse distributed memory consists…
Modified: March 29, 2024.
References: Jacobs, Jordan, Nowlan, Hinton. Adaptive Mixtures of Local Experts (1991) Shazeer et al. Outrageously Large Neural Networks…
Modified: February 13, 2023.
References: Jacobs, Jordan, Nowlan, Hinton. Adaptive Mixtures of Local Experts (1991) Shazeer et al. Outrageously Large Neural Networks…
Modified: February 13, 2023.
In kindergarten stats, you learn how to build a model that takes in data (a feature vector, image, sound file, etc) and predicts a single…
Modified: March 03, 2022.
A -dimensional vector can represent distinct orthogonal features, but due to the weirdness of [ high-dimension ]al geometry, it can…
Modified: September 14, 2022.
Something that confused for me for a while is that people in certain communities talk about 'teacher forcing' as though it's a trick or a…
Modified: October 13, 2022.
Rob wants to firm up his foundations. He wants to understand relevant stats, probabilistic models, inference, and maybe work our way up to…
Modified: January 25, 2022.
Every in machine learning talks about tensors, but no one really understands what they are. This page collects several definitions and…
Modified: July 18, 2022.
How should a machine learning model represent text? Word-level and character-level features are obvious options, but both have drawbacks…
Modified: February 13, 2023.
These days we think a lot about using data to train large [ language model ]s. But there's only so much data in the world; eventually we'll…
Modified: October 27, 2022.
In developing intuition about [ transformer ]s it's useful to think about specific primitive operations that can be implemented by a small…
Modified: February 13, 2023.
The core of the transformer architecture is multi-headed [ attention ]. The transformer block consists of a multi-headed attention layer…
Modified: February 13, 2023.
References: Jacob Eisner, High-Level Explanation of Variational Inference (2011) https://www.cs.jhu.edu/~jason/tutorials/variational.html…
Modified: April 26, 2022.
Holy shit. In December on Galiano I was brainstorming about [ continuous structure learning ] and thought of the general trick, for…
Modified: June 09, 2020.
Note: these are personal notes, taken as I was refreshing myself on this material. They're mostly stream of consciousness and probably not…
Modified: March 16, 2022.