chain rule: Nonlinear Function
Created: August 13, 2022
Modified: August 13, 2022

chain rule

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

There are two major 'chain rules' relevant to machine learning: the chain rule of probability theory and the chain rule from calculus.

Probability chain rule: Any joint density factors as the product of conditional densities

p(x1,,xn)=p(x1)i=2np(xix1,,xi1).p(x_1, \ldots, x_n) = p(x_1)\prod_{i=2}^n p(x_i | x_1, \ldots, x_{i-1}).

Calculus chain rule: the derivative of a function composition f(g(x))f(g(x)) is given by the product of 'local' derivatives

ddxf(g(x))=f(g(x))g(x)=df(g(x))dg(x)dg(x)dx\frac{d}{dx} f(g(x)) = f'(g(x))\cdot g'(x) = \frac{df(g(x))}{dg(x)}\frac{dg(x)}{dx}

or in the multivariate case, the (transposed) Jacobian matrix Jxf(y(x))=JfgTJ_{x\to f(y(x))} = J_{f\circ g}^T is the product of local (transposed) Jacobians,

Jxf(y(x))=Jxy(x)Jy(x)f(y(x)).J_{x\to f(y(x))} = J_{x \to y(x)} J_{y(x) \to f(y(x))}.

This fact is the foundation of limitations of autodiff.