Link: A Universal Law of Robustness via Isoperimetry | OpenReview This paper purports to explain (and quantify) the observed fact that…

In the spirit of [ prediction as a model-building exercise ]. Language modeling: system writes publishable poetry: debatably already…

See https://emtiyaz.github.io/papers/learning_from_bayes.pdf Suppose we have a learning problem For some choice of exponential-family…

For any strictly [ convex ] function , define the Bregman divergence: Examples: (Squared) Euclidean distance : choose the squared norm…

Related to [ natural gradient ] and the [ Fisher information ] matrix. Let's say we have a parametric model of some data. The Cramer-Rao…

Generates images from captions by combining [ CLIP ]-style representation learning with a [ diffusion model ] model to construct images from…

Weight-Space View Recall standard linear regression. We suppose and where , where can be augmented with an implicit 1 term to allow a…

Good tutorial: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Yann Lecun's famous cake analogy is that: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake…

We're given a [ constrained optimization ] problem Note that the standard formulation of Lagrange multipliers handles only equality…

Gilmer at al. paper 2017 Experiments on QM9. Unlike SMILES strings, includes molecular geometry . General formulation of message passing…

If two hypotheses are equally consistent with the data, the simpler is more likely to be 'true'. Formally, it is more likely to generalize…

I'm trying to build my understanding. These are fragments of intuitions. Bayesian inference starts with a prior P and a likelihood. Given…

Massive list here: https://github.com/cedrickchee/awesome-bert-nlp Bahdanau, Cho, Bengio. Neural Machine Translation by Jointly Learning to…

Fundamentally an algorithm is any computational procedure: something that takes in data and spits out some function of that data. Computer…

This is my stab at explaining automatic differentiation, specifically backprop and applications to neural nets. A few dimensions to think…

One of the best ideas in machine learning. (I even thought so in 2011!) There are two common mechanisms: 'soft' and 'hard'. In both cases…

I think of "variance" as the error in a statistical estimate that comes from not having enough data (assuming an [ identifiable ] model…

The [ distinction ] between classification and regression is, from one point of view, arbitrary: it's all just function approximation, and…

Arguably the core insight of deep learning / [ differentiable program ]ming is that the shape and structure of the computations we do are so…

Suppose we want to optimize an objective under some equality and/or inequality constraints, Some general classes of approach we can use are…

Relevant papers: DIfferentiable compositional kernel learning for Gaussian Processes (Sun et al., 2018) Differentiable Architecture Search…

A technique for [ representation ] learning in which semantically similar datapoints are encouraged to have similar representations, and…

A method for fitting an unnormalized probability density (aka [ energy-based model ]) to data. Note that this is a different and harder…

References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…

Current (2021) deep networks require huge datasets in order to [ generalization|generalize ]. But we know that humans can do one-shot…

References: Holtzman et al. (2020), The Curious Case of Neural Text Degeneration https://arxiv.org/abs/1904.09751 How should we actually…

Notes from John Schulman's Berkeley course on deep [ reinforcement learning ], Spring 2016. Value vs Policy-based learning Value-based…

see also: [ differentiable program ]

Fast differentiable sorting and ranking: https://arxiv.org/abs/2002.08871 What are differentiable analogues of 'standard' programming…

Diffusion models for image generation were independently invented at least twice: in a discrete-time variational inference framework…

References: ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus , Causal Inference 2: Illustrating Interventions via a Toy…

Empirically, as model capacity increases past the memorization threshold ( ), [ generalization|generalization ] error starts decreasing…

TODO: flesh out theory, understand ADMM (e.g., https://www.cis.upenn.edu/~cis515/ws-book-IIb.pdf )

Often we think of ensembles in the context of supervised learning: we have some algorithm that learns X -> y mappings, and by running it…

Considering training an [ autoregressive ] model of sequence data (text, audio, action sequences in [ reinforcement learning ], etc.), which…

This note is a scratchpad for investigating the expressivity of the [ transformer ] architecture. In general, one set of intuitions that we…

Exponential Families, Conjugacy, Convexity, and Variational Inference Any parameterized family of probability densities that can be written…

On an evolutionary timescale, it's useful to evolve structures that can learn quickly. The nervous system is an evolved organ system for…

As AGW points out here , it is statistically better to fit a flexible model family, with an inductive bias, than a constrained model family…

'[ no free lunch theorem ]' arguments are misleading because they consider the space of all possible functions. In fact, we usually care…

Fundamentally, where does generalization come from? [ causality ]: a model may generalize because it has discovered the true mechanism, or…

Sutton and Barto use this as a general term for any form of interleaving policy evaluation steps with policy improvement steps. This…

Many objects can be generated by a sequence of actions. For example: Generating language by adding one word at a time Generating a molecule…

"What I cannot create, I do not understand". Related to: [ computational complexity ]: provers vs verifiers. [ P != NP ] [ production vs…

Why do we clip gradients in deep learning? When is it important and what is the right way to do it? It seems like the standard recipe used…

For a normalized distribution , constructed from an (unnormalized) energy with normalizing constant as a function of parameters , in…

A 'graph neural net' is a differentiable, parameterized function whose input or output (or both) is a graph. Discriminative: graph as input…

A nice observation from Percy Liang on the relationship between language modeling and grounded understanding: Just because you don't…

Closely related to [ discrete latent variable ]s and to [ reinforcement learning ] with discrete actions. If I do a thing and it goes well…

Multiple senses: An 'ill-conditioned matrix' has a large ratio between its largest and smallest eigenvalue (more generally, see what is a…

[ grokking ] / [ phase change hypothesis ] emergence of near-discrete features in large transformers symmetries / non-[ identifiable…

Ways to specify inductive bias: Feature engineering Prior distribution acts as regularizer in MAP estimates Graphical model (constraint on…

multiple senses: in machine learning: positive definite (Mercer) kernels in linear algebra: kernel (nullspace) of a linear map in CS systems…

If you believe that neural nets basically just memorize the training data, then training larger and larger models is hopeless. The…

Unlike most modern [ deep learning ] systems, humans: don't have separate training/test phases (though we may have wake/[ sleep ]) don't…

Generally this means training some aspect of the learning procedure itself. There is then an inner-loop learning procedure, which follows…

Short descriptions of things, when they exist, must capture some kind of structure. The principle of [ Occam's razor ] posits that we should…

Mirror descent is a framework for optimization algorithms: many algorithms can be framed as mirror descent, and proofs about mirror descent…

I have a [ strong opinion weakly held ] that doesn't seem to be wildly shared in the [ approximate Bayesian inference ] community: reverse…

In any human-to-human interaction, language carries some very important high-order bits, but it can only carry a few bits. It can help…

From a conversation I had about [ attention ] mechanisms in deep architectures. Maybe that terminology is too suggestive --- it's just a…

We don't typically think of it this way, but you can derive a [ gradient descent ] step as finding the point that minimizes a linearized…

Christian Naesseth, Fredrik Lindsten, Thomas Schon (2015): http://proceedings.mlr.press/v37/naesseth15.html The main idea: In an SMC…

Like the proverbial half-full glass, smart people can look at the same reality of the current capacities of neural nets, and come to…

The folklore no-free-lunch 'theorem' in machine learning says that, for any pair of learning algorithms, there exists some dataset on which…

https://arxiv.org/abs/1712.02390 Basic idea: optimizers like Adam and RMSProp already keep track of posterior curvature estimates. These are…

reading the perceiver papers from Deepmind: Perceiver: Jaegle et al 2021 https://arxiv.org/abs/2103.03206 Perceiver-IO: Jaegle et al 202…

(see also: [ large models ]) There's a viewpoint that neural nets just memorize the training data, so the more training data you have, the…

It seems like there is, or can be, a virtuous relationship between privacy and generalization. You don't want to memorize too many…

Can we think about [ generative flow network ]s as a potentially tractable formulation of probabilistic program induction?! executing a line…

Many [ probabilistic programming ] researchers frame their work as part of the broader problem of [ artificial intelligence ]. Artificial…

A short note on interpreting a transformer layer as performing maximum-likelihood inference in a Gaussian mixture model: https://arxiv.org…

Introduced by Geoff Hinton (1999): Products of Experts . Each expert produces a probability distribution. These are combined by…

It's tempting to use [ natural gradient ] ascent to optimize a variational distribution. We could also consider using it to optimize the…

Note : see [ reinforcement learning notation ] for a guide to the notation I'm attempting to use through my RL notes. Three paradigmatic…

The selection operation y = where(c, a, b) returns How can a [ transformer ] layer implement this operation? One approach is to is to use…

Suppose we want a [ transformer ] to evaluate the inequality returning if and otherwise. For integer , this can be done with a…

If a model with data has normalizing constant , then the replica trick says that This allows us to analyze the average log-normalizer…

In modern ML, representation learning is the art of trying to find useful abstractions, embodied as encoding networks. We can learn…

Scheduled sampling is a training procedure for sequence models that attempts to mitigate [ exposure bias ] - the problem in which generation…

References: Jacobs, Jordan, Nowlan, Hinton. Adaptive Mixtures of Local Experts (1991) Shazeer et al. Outrageously Large Neural Networks…

In kindergarten stats, you learn how to build a model that takes in data (a feature vector, image, sound file, etc) and predicts a single…

A -dimensional vector can represent distinct orthogonal features, but due to the weirdness of [ high-dimension ]al geometry, it can…

Something that confused for me for a while is that people in certain communities talk about 'teacher forcing' as though it's a trick or a…

Rob wants to firm up his foundations. He wants to understand relevant stats, probabilistic models, inference, and maybe work our way up to…

Every in machine learning talks about tensors, but no one really understands what they are. This page collects several definitions and…

How should a machine learning model represent text? Word-level and character-level features are obvious options, but both have drawbacks…

These days we think a lot about using data to train large [ language model ]s. But there's only so much data in the world; eventually we'll…

In developing intuition about [ transformer ]s it's useful to think about specific primitive operations that can be implemented by a small…

The core of the transformer architecture is multi-headed [ attention ]. The transformer block consists of a multi-headed attention layer…

How should people do VI? One ultimate goal is that you write a Stan model (or better, a model with discrete variables, but one step at a…

Holy shit. In December on Galiano I was brainstorming about [ continuous structure learning ] and thought of the general trick, for…

Note: these are personal notes, taken as I was refreshing myself on this material. They're mostly stream of consciousness and probably not…