Created: August 13, 2022
Modified: August 13, 2022

chain rule

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

There are two major 'chain rules' relevant to machine learning: the chain rule of probability theory and the chain rule from calculus.

Probability chain rule: Any joint density factors as the product of conditional densities

p(x_1, \ldots, x_n) = p(x_1)\prod_{i=2}^n p(x_i | x_1, \ldots, x_{i-1}).

Calculus chain rule: the derivative of a function composition $f(g(x))$ is given by the product of 'local' derivatives

\frac{d}{dx} f(g(x)) = f'(g(x))\cdot g'(x) = \frac{df(g(x))}{dg(x)}\frac{dg(x)}{dx}

or in the multivariate case, the (transposed) Jacobian matrix $J_{x\to f(y(x))} = J_{f\circ g}^T$ is the product of local (transposed) Jacobians,

J_{x\to f(y(x))} = J_{x \to y(x)} J_{y(x) \to f(y(x))}.

This fact is the foundation of limitations of autodiff.

chain rule

Links to this note

exposure bias

teacher forcing

automatic differentiation

Jacobian

Meta