Notes tagged with "machine-learning": Nonlinear Function

100 notes tagged with "machine-learning"

A Universal Law of Robustness via Isoperimetry

Link: A Universal Law of Robustness via Isoperimetry | OpenReview This paper purports to explain (and quantify) the observed fact that…

AI predictions

In the spirit of [ prediction as a model-building exercise ]. Language modeling: system writes publishable poetry: debatably already…

Bayesian learning rule

See https://emtiyaz.github.io/papers/learning_from_bayes.pdf Suppose we have a learning problem For some choice of exponential-family…

Bregman divergence

For any strictly [ convex ] function , define the Bregman divergence: Examples: (Squared) Euclidean distance : choose the squared norm…

Cramer-Rao bound

Related to [ natural gradient ] and the [ Fisher information ] matrix. Let's say we have a parametric model of some data. The Cramer-Rao…

DALL-E

Generates images from captions by combining [ CLIP ]-style representation learning with a [ diffusion model ] model to construct images from…

Gaussian process

Weight-Space View Recall standard linear regression. We suppose and where , where can be augmented with an implicit 1 term to allow a…

LSTM

Good tutorial: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

LeCun's Cherry

Yann Lecun's famous cake analogy is that: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake…

Lagrange multiplier

We're given a [ constrained optimization ] problem Note that the standard formulation of Lagrange multipliers handles only equality…

Neural message passing for Quantum Chemistry

Gilmer at al. paper 2017 Experiments on QM9. Unlike SMILES strings, includes molecular geometry . General formulation of message passing…

Occam's razor

If two hypotheses are equally consistent with the data, the simpler is more likely to be 'true'. Formally, it is more likely to generalize…

Pac-Bayes

I'm trying to build my understanding. These are fragments of intuitions. Bayesian inference starts with a prior P and a likelihood. Given…

Transformer Papers

Massive list here: https://github.com/cedrickchee/awesome-bert-nlp Bahdanau, Cho, Bengio. Neural Machine Translation by Jointly Learning to…

algorithm

Fundamentally an algorithm is any computational procedure: something that takes in data and spits out some function of that data. Computer…

approximate Bayesian inference

automatic differentiation

This is my stab at explaining automatic differentiation, specifically backprop and applications to neural nets. A few dimensions to think…

attention

One of the best ideas in machine learning. (I even thought so in 2011!) There are two common mechanisms: 'soft' and 'hard'. In both cases…

bias-variance tradeoff

I think of "variance" as the error in a statistical estimate that comes from not having enough data (assuming an [ identifiable ] model…

calibration

classification is special

The [ distinction ] between classification and regression is, from one point of view, arbitrary: it's all just function approximation, and…

compression

computation is important

Arguably the core insight of deep learning / [ differentiable program ]ming is that the shape and structure of the computations we do are so…

constrained optimization

Suppose we want to optimize an objective under some equality and/or inequality constraints, Some general classes of approach we can use are…

continuous structure learning

Relevant papers: DIfferentiable compositional kernel learning for Gaussian Processes (Sun et al., 2018) Differentiable Architecture Search…

contrastive learning

A technique for [ representation ] learning in which semantically similar datapoints are encouraged to have similar representations, and…

control variate

contrastive divergence

A method for fitting an unnormalized probability density (aka [ energy-based model ]) to data. Note that this is a different and harder…

cooperative inverse reinforcement learning

References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…

data efficiency

Current (2021) deep networks require huge datasets in order to [ generalization|generalize ]. But we know that humans can do one-shot…

decoding

References: Holtzman et al. (2020), The Curious Case of Neural Text Degeneration https://arxiv.org/abs/1904.09751 How should we actually…

deep RL notes

Notes from John Schulman's Berkeley course on deep [ reinforcement learning ], Spring 2016. Value vs Policy-based learning Value-based…

deep learning

differentiable program

Fast differentiable sorting and ranking: https://arxiv.org/abs/2002.08871 What are differentiable analogues of 'standard' programming…

diffusion model

Diffusion models for image generation were independently invented at least twice: in a discrete-time variational inference framework…

do-calculus

References: ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus , Causal Inference 2: Illustrating Interventions via a Toy…

double descent

Empirically, as model capacity increases past the memorization threshold ( ), [ generalization|generalization ] error starts decreasing…

dual gradient ascent

TODO: flesh out theory, understand ADMM (e.g., https://www.cis.upenn.edu/~cis515/ws-book-IIb.pdf )

energy-based model

ensemble

Often we think of ensembles in the context of supervised learning: we have some algorithm that learns X -> y mappings, and by running it…

exposure bias

Considering training an [ autoregressive ] model of sequence data (text, audio, action sequences in [ reinforcement learning ], etc.), which…

expressive transformer

This note is a scratchpad for investigating the expressivity of the [ transformer ] architecture. In general, one set of intuitions that we…

exponential family notes

Exponential Families, Conjugacy, Convexity, and Variational Inference Any parameterized family of probability densities that can be written…

fast weights

On an evolutionary timescale, it's useful to evolve structures that can learn quickly. The nervous system is an evolved organ system for…

flexible model family

As AGW points out here , it is statistically better to fit a flexible model family, with an inductive bias, than a constrained model family…

free lunch theorem

'[ no free lunch theorem ]' arguments are misleading because they consider the space of all possible functions. In fact, we usually care…

generalization

Fundamentally, where does generalization come from? [ causality ]: a model may generalize because it has discovered the true mechanism, or…

generalized policy iteration

Sutton and Barto use this as a general term for any form of interleaving policy evaluation steps with policy improvement steps. This…

generative flow network

Many objects can be generated by a sequence of actions. For example: Generating language by adding one word at a time Generating a molecule…

generative vs discriminative modeling

"What I cannot create, I do not understand". Related to: [ computational complexity ]: provers vs verifiers. [ P != NP ] [ production vs…

gradient clipping

Why do we clip gradients in deep learning? When is it important and what is the right way to do it? It seems like the standard recipe used…

gradient of the log normalizer

For a normalized distribution , constructed from an (unnormalized) energy with normalizing constant as a function of parameters , in…

graph neural networks

A 'graph neural net' is a differentiable, parameterized function whose input or output (or both) is a graph. Discriminative: graph as input…

grounded

A nice observation from Percy Liang on the relationship between language modeling and grounded understanding: Just because you don't…

hard attention

Closely related to [ discrete latent variable ]s and to [ reinforcement learning ] with discrete actions. If I do a thing and it goes well…

ill-conditioned

Multiple senses: An 'ill-conditioned matrix' has a large ratio between its largest and smallest eigenvalue (more generally, see what is a…

important neural net phenomena

[ grokking ] / [ phase change hypothesis ] emergence of near-discrete features in large transformers symmetries / non-[ identifiable…

inductive bias

Ways to specify inductive bias: Feature engineering Prior distribution acts as regularizer in MAP estimates Graphical model (constraint on…

kernel

multiple senses: in machine learning: positive definite (Mercer) kernels in linear algebra: kernel (nullspace) of a linear map in CS systems…

large models

If you believe that neural nets basically just memorize the training data, then training larger and larger models is hopeless. The…

meta-level shape of machine learning

Unlike most modern [ deep learning ] systems, humans: don't have separate training/test phases (though we may have wake/[ sleep ]) don't…

meta learning

Generally this means training some aspect of the learning procedure itself. There is then an inner-loop learning procedure, which follows…

minimum description length

Short descriptions of things, when they exist, must capture some kind of structure. The principle of [ Occam's razor ] posits that we should…

mirror descent

Mirror descent is a framework for optimization algorithms: many algorithms can be framed as mirror descent, and proofs about mirror descent…

mode-covering variational inference is incoherent

I have a [ strong opinion weakly held ] that doesn't seem to be wildly shared in the [ approximate Bayesian inference ] community: reverse…

most learning is by demonstration

In any human-to-human interaction, language carries some very important high-order bits, but it can only carry a few bits. It can help…

multiplicative interaction

From a conversation I had about [ attention ] mechanisms in deep architectures. Maybe that terminology is too suggestive --- it's just a…

natural gradient

We don't typically think of it this way, but you can derive a [ gradient descent ] step as finding the point that minimizes a linearized…

nested SMC

Christian Naesseth, Fredrik Lindsten, Thomas Schon (2015): http://proceedings.mlr.press/v37/naesseth15.html The main idea: In an SMC…

neural nets do work

Like the proverbial half-full glass, smart people can look at the same reality of the current capacities of neural nets, and come to…

no free lunch theorem

The folklore no-free-lunch 'theorem' in machine learning says that, for any pair of learning algorithms, there exists some dataset on which…

noisy natural gradient as VI

https://arxiv.org/abs/1712.02390 Basic idea: optimizers like Adam and RMSProp already keep track of posterior curvature estimates. These are…

perceiver

reading the perceiver papers from Deepmind: Perceiver: Jaegle et al 2021 https://arxiv.org/abs/2103.03206 Perceiver-IO: Jaegle et al 202…

phase change hypothesis

(see also: [ large models ]) There's a viewpoint that neural nets just memorize the training data, so the more training data you have, the…

privacy

It seems like there is, or can be, a virtuous relationship between privacy and generalization. You don't want to memorize too many…

probabilistic program induction

Can we think about [ generative flow network ]s as a potentially tractable formulation of probabilistic program induction?! executing a line…

probabilistic programming is not AI research

Many [ probabilistic programming ] researchers frame their work as part of the broader problem of [ artificial intelligence ]. Artificial…

probabilistic programming

probabilistic transformers

A short note on interpreting a transformer layer as performing maximum-likelihood inference in a Gaussian mixture model: https://arxiv.org…

product of experts

Introduced by Geoff Hinton (1999): Products of Experts . Each expert produces a probability distribution. These are combined by…

pushforward natural gradient

It's tempting to use [ natural gradient ] ascent to optimize a variational distribution. We could also consider using it to optimize the…

reinforcement learning

Note : see [ reinforcement learning notation ] for a guide to the notation I'm attempting to use through my RL notes. Three paradigmatic…

relu selection

The selection operation y = where(c, a, b) returns How can a [ transformer ] layer implement this operation? One approach is to is to use…

relu inequality

Suppose we want a [ transformer ] to evaluate the inequality returning if and otherwise. For integer , this can be done with a…

replica trick

If a model with data has normalizing constant , then the replica trick says that This allows us to analyze the average log-normalizer…

representation

In modern ML, representation learning is the art of trying to find useful abstractions, embodied as encoding networks. We can learn…

scheduled sampling

Scheduled sampling is a training procedure for sequence models that attempts to mitigate [ exposure bias ] - the problem in which generation…

sparse mixture of experts

References: Jacobs, Jordan, Nowlan, Hinton. Adaptive Mixtures of Local Experts (1991) Shazeer et al. Outrageously Large Neural Networks…

structured prediction

In kindergarten stats, you learn how to build a model that takes in data (a feature vector, image, sound file, etc) and predicts a single…

superposition

A -dimensional vector can represent distinct orthogonal features, but due to the weirdness of [ high-dimension ]al geometry, it can…

teacher forcing

Something that confused for me for a while is that people in certain communities talk about 'teacher forcing' as though it's a trick or a…

teaching machine learning

Rob wants to firm up his foundations. He wants to understand relevant stats, probabilistic models, inference, and maybe work our way up to…

tensor

Every in machine learning talks about tensors, but no one really understands what they are. This page collects several definitions and…

tokenize

How should a machine learning model represent text? Word-level and character-level features are obvious options, but both have drawbacks…

training for consistency

These days we think a lot about using data to train large [ language model ]s. But there's only so much data in the world; eventually we'll…

transformer primatives

In developing intuition about [ transformer ]s it's useful to think about specific primitive operations that can be implemented by a small…

transformer

The core of the transformer architecture is multi-headed [ attention ]. The transformer block consists of a multi-headed attention layer…

variational inference

How should people do VI? One ultimate goal is that you write a Stan model (or better, a model with discrete variables, but one step at a…

variational optimization

Holy shit. In December on Galiano I was brainstorming about [ continuous structure learning ] and thought of the general trick, for…

mcmc notes

Note: these are personal notes, taken as I was refreshing myself on this material. They're mostly stream of consciousness and probably not…

See All tags