trace: Nonlinear Function
Created: November 12, 2013
Modified: March 16, 2022

trace

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Trace of a Linear Operator

We define the trace as the sum of diagonal elements of a matrix:

tr(A)=iaii.\text{tr}(A) = \sum_{i} a_{ii}.

Lemma: If AA and BB are square, then tr(AB)=tr(BA)\text{tr}(AB) = \text{tr}(BA). Proof: Using the notation above, we write

tr(AB)=ijaijbji=jibjiaij=tr(BA).\text{tr}(AB) = \sum_i \sum_j a_{ij} b_{ji} = \sum_j \sum_i b_{ji} a_{ij} = \text{tr}(BA).

An analogous argument using the general notation for multiple matrices shows that in general trace is invariant to cyclic permutation, i.e. it holds that tr(ABC)=tr(CAB)=tr(BCA)\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA). Note that trace is not invariant to arbitrary permutations (or even just swapping of adjacent pairs). For example, tr(ABC)tr(BAC)\text{tr}(ABC) \ne \text{tr}(BAC) in general.

Now we can prove an important result.

Theorem: Trace is preserved under change of basis. Proof: Let BB be an invertible matrix. Then tr(B1AB)=tr(BB1A)=tr(A)\text{tr}(B^{-1}AB) = \text{tr}(BB^{-1}A) = \text{tr}(A) by the circular permutation argument above. This shows that for any change-of-basis matrix BB, the trace of AA is preserved in the new basis.

From this we get an easy, but important corollary.

Theorem: The trace of a linear operator is the sum of its eigenvalues. Proof: Let AA be a diagonalizable linear operator, i.e., one with a full set of orthonormal eigenvectors. Then we can write AA in its eigenbasis

A=VTDVA = V^T D V

where DD is the diagonal matrix of eigenvalues Dii=λiD_{ii} = \lambda_i and the rows of VV are the eigenvectors (v1,,vn)(v_1, \ldots, v_n). Now by the theorem above, we have

tr(A)=tr(D)=iλi.\text{tr}(A) = \text{tr}(D) = \sum_i \lambda_i.

Note this theorem is actually true even for non-diagonalizable operators (i.e., where we can't write an explicit eigendecomposition because there aren't nn linearly independent eigenvectors), but you need a different proof.

Randomized Approximation

Suppose we want to compute tr(A1)\text{tr}(A^{-1}), but we don't want to invert AA since that requires n2n^2 time. There's a randomized trick which gets around this. Let xN(0,I)x \sim \mathcal{N}(0, I). Then

tr(A1)=tr(A1I)=tr(A1E[xxT])=E[tr(A1xxT)]=E[tr(xTA1x)]=E[xTA1x].\text{tr}(A^{-1}) = \text{tr}(A^{-1} I) = \text{tr}(A^{-1} E[x x^T]) = E\left[ \text{tr}(A^{-1} x x^T)\right] = E[\text{tr}(x^T A^{-1} x)] = E[x^T A^{-1} x].

So we can approximate the trace by sampling a bunch of vectors xx, and averaging the resulting values of xTA1xx^T A^{-1} x. We can compute each such value in n2n^2 time by solving the linear system Ay=xAy=x, then multiplying xTyx^Ty. As long as we have to do this fewer than nn times to get a good approximation, we'll be saving time.

There could probably be some theory here about how many samples we'll actually need to get a good approximation. I think http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/Sardenia_Talk.pdf has more info.