Created: August 13, 2022
Modified: August 13, 2022

Jacobian

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The partial derivatives of a multivariate function $f(x): \mathbb{R}^n\to\mathbb{R}^m$ form its Jacobian matrix

J_f = \left(\begin{array}{ccc} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & & \vdots\\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}\\ \end{array}\right).

The convention here (matching Wikipedia, and I believe also used by Jax and other limitations of autodiff systems) is that the Jacobian is the $m\times n$ matrix whose rows are output coordinates and columns are input coordinates. This matches the notation for function composition in which $f(g(x))$ denotes a right-to-left flow from the inputs $x$ to the final output from $f$ .

Transpose notation

I don't love the usual convention because I generally find it more intuitive to think of input-to-output flow as a left-to-right mapping. So I sometimes prefer to work explicitly in terms of the transpose $J^T$ , using the notation

J_{x\to f(x)} = J_f^T

to represent an $n \times m$ matrix $J_{x\to f(x)}$ .

In this notation, the chain rule for the derivative of a function $f(g(x))$ becomes a left-to-right computation graph,

J_{x \to (f \circ g)(x)} = J_{x \to g(x)} J_{g(x) \to f(g(x))}

which I find more intuitive than the equivalent $J_{f\circ g} = J_{f(g)} J_{g(x)}$ in the usual notation.

I also like that this notation specializes nicely to the gradient of a scalar function $f(x): \mathbb{R}^n \to \mathbb{R}$ ,

J_{x\to f(x)} = \nabla_x f(x)

under the usual convention that the gradient is a column vector.Debatably, gradients should be thought of as row vectors to indicate that they are really dual vectors, aka linear functionals on the underlying vector space. Under this view, the conventional definition of a Jacobian is more natural. However, machine learning usually ignores this, implicitly using the isomorphism between a vector space and its dual induced by whatever basis we're using to do the computations.

Jacobian

Transpose notation

Links to this note

chain rule

fixed point

Meta