Doing [ math ] seems like a really promising area for AI. And by 'math' I mean math research (not arithmetic, which computers are already…

See https://emtiyaz.github.io/papers/learning_from_bayes.pdf Suppose we have a learning problem For some choice of exponential-family…

The Bayesian approach to statistics is to 'just use probability theory'. You write down a joint probability distribution over observed and…

A model of [ option ] prices that assumes: The existence of a risk-free asset paying some interest rate, for example, US Treasury bonds…

Aka computational trinitarianism . Churchill, My Early Life : I have noticed in my life deep resemblances between many different kinds of…

Related to [ natural gradient ] and the [ Fisher information ] matrix. Let's say we have a parametric model of some data. The Cramer-Rao…

A pointwise maximum of [ convex ] functions Specifically, we require that is convex in for every . is itself convex in , and when…

Any reasonable 'adapted' and 'integrable' [ stochastic process ] can be written as the sum of a [ martingale ] and a [ predictable process…

One-particle system Let be the [ Lagrangian ] for a system with time-varying position and velocity , with forces defined by a potential…

Given a [ diffusion process ] specified by the [ stochastic differential equation ] the [ Fokker-Planck ] equation aka Kolmogorov forward…

Weight-Space View Recall standard linear regression. We suppose and where , where can be augmented with an implicit 1 term to allow a…

A Itô process is a [ stochastic process ] satisfying a [ stochastic differential equation ] of the form where is Brownian motion. This…

This is the technical formulation that makes it meaningful to write [ stochastic differential equation ]s 'driven by' a Weiner process…

The partial derivatives of a multivariate function form its Jacobian matrix The convention here (matching Wikipedia, and I believe also…

For any [ convex ] function and probability distribution , Jensen's inequality states that The special case of a distribution over two…

Given a [ constrained optimization ] problem over a [ convex ] function , we consider the [ Lagrangian ] function introducing variables…

The Kraft inequality in information theory states (roughly?) that, for any probability distribution , there is a prefix code C under which…

We're given a [ constrained optimization ] problem Note that the standard formulation of Lagrange multipliers handles only equality…

References: Jess Riedel on the Legendre transform in physics Stack Overflow discussion Prof. V. Balakrishnan on Hamiltonian dynamics…

Used in analyzing the stability of an equilibrium of a dynamical system. A Lyapunov function is a scalar-valued function of the state space…

A [ stochastic process ] in which the past is independent of the future, conditioned on the current value. Striking point made by https…

Notes from Charles Margossian's talk on pharmacometrics models. Types of ODEs: Linear: can be solved by [ matrix exponential ]ials nonlinear…

See [ generative vs discriminative modeling ], [ actor-critic ]

I'm trying to build my understanding. These are fragments of intuitions. Bayesian inference starts with a prior P and a likelihood. Given…

Notes from working through Kevin Buzzard's Natural number game (imperial.ac.uk) using the Lean theorem prover. We know from the [ Curry…

The -Wasserstein distance between probability distributions is defined as where the infimum is over all joint distributions having…

Fundamentally an algorithm is any computational procedure: something that takes in data and spits out some function of that data. Computer…

This is my stab at explaining automatic differentiation, specifically backprop and applications to neural nets. A few dimensions to think…

There are two major 'chain rules' relevant to machine learning: the chain rule of probability theory and the chain rule from calculus…

Multivariate Completion of Squares A useful trick: if is a symmetric, nonsingular matrix, then This is easy to see just by expanding out…

Suppose we want to optimize an objective under some equality and/or inequality constraints, Some general classes of approach we can use are…

A contraction mapping on a metric space is a function such that for all and for some , called the [ Lipschitz ] constant of the map…

See also: https://www2.sonycsl.co.jp/person/nielsen/Note-LegendreTransformation.pdf Jess Riedel on the Legendre transform in physics looks…

A convex function satisfies the property that a line between any two points on its graph is on or above the graph: for any . It is…

References: http://www0.cs.ucl.ac.uk/staff/C.Archambeau/SDE_web/figs_files/ca07_RgIto_text.pdf https://www.ma.imperial.ac.uk/~pavl/lec_diff…

TODO: flesh out theory, understand ADMM (e.g., https://www.cis.upenn.edu/~cis515/ws-book-IIb.pdf )

Measures uncertainty, disorder, or randomness. The (Shannon) entropy of a probability distribution is: The quantity inside the…

Exponential Families, Conjugacy, Convexity, and Variational Inference Any parameterized family of probability densities that can be written…

A filtration is defined by monotonically increasing subsets of a [ probability space ]; that is, subsets such that we have for all…

We say that is a fixed point of an update rule if . Update rules can often (though not necessarily) be seen as defining an…

"What I cannot create, I do not understand". Related to: [ computational complexity ]: provers vs verifiers. [ P != NP ] [ production vs…

Multiple senses: An 'ill-conditioned matrix' has a large ratio between its largest and smallest eigenvalue (more generally, see what is a…

Importance sampling allows us to compute expectations under a distribution using samples from a different distribution , by weighting the…

The Leibniz calculus notation using infinitestimal quantities like or is simultaneously Very sensible and intuitive, but also Constantly…

multiple senses: in machine learning: positive definite (Mercer) kernels in linear algebra: kernel (nullspace) of a linear map in CS systems…

A martingale is any [ stochastic process ] that stays the same in expectation. Formally, is a martingale if This condition is related to…

The Woodbury-Morrison-Sherman matrix inversion lemma, is sometimes useful just for algebraic simplifications. In cases where and are…

Reviewing this 3blue1brown video: https://www.youtube.com/watch?v=O85OWBJ2ayo The matrix exponential is written as E to the power of a…

A function is measurable with respect to [ sigma-algebra ]s on its domain and on its range if the pre-image of any event is…

Considering a bilevel optimization problem (or saddle point problem) on the two-argument function , in general it holds that That is, the…

We say that a random vector is multivariate Gaussian with mean and covariance matrix if it can be written where is a vector if i.i.d…

A negligible function is a function such that, for any positive integer there exists an integer such that for all , i.e., that…

If is a [ martingale ] and is a [ stopping time ], then any of the following conditions implies that : The stopping time is bounded…

A [ stochastic process ] is predictable if its value at time is fully determined by information available at time . Any fully…

A probability space consists of: A set of outcomes aka possible worlds; these represent all the ways the world might be. This is the…

Introduced by Geoff Hinton (1999): Products of Experts . Each expert produces a probability distribution. These are combined by…

Proximal methods in optimization The proximal operator of a [ convex ] function is defined as the minimizer of plus a distance penalty…

Formally, a random variable is a (measurable) function defined on outcomes from a [ probability space ] . That is, in any possible…

The rate equation or master equation for a continuous-time Markov [ stochastic process ] describes how the probability density of the…

References: Ludwig Winkler's post on Reverse time stochastic differential equations . Suppose we have a [ stochastic differential equation…

The score function is the gradient of a log-density with respect to its parameters: It is the direction that we would move the parameters…

A [ stochastic process ] is (strictly) stationary if all of its joint distributions are invariant under time displacement. It is wide…

SDEs are typically written in terms of the differential of a Weiner process (Brownian motion), e.g., Although Weiner processes are nowhere…

A stochastic process is a collection of [ random variable ]s defined on a common [ probability space ] . Equivalently, it is a joint…

A stopping time for a stochastic process is a time-valued That is, integer-valued for discrete-time processes and real-valued for…

The tensor product of two vector spaces (defined on the same scalar field, we'll assume ) is the vector space of formal sums of…

Every in machine learning talks about tensors, but no one really understands what they are. This page collects several definitions and…

Trace of a Linear Operator We define the trace as the sum of diagonal elements of a matrix: Lemma : If and are square, then . Proof…

According to this reddit post , one of the main takeaways of functional analysis is that the right way to interpret the 'transpose' of a…

Inspired by Kevin Buzzard's overview of the state of automatic theorem provers. Type theory is like set theory in that sets and types are…

Note: these are personal notes, taken as I was refreshing myself on this material. They're mostly stream of consciousness and probably not…