Created: February 09, 2014
Modified: March 16, 2022

multivariate gaussian

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

We say that a random vector $Y$ is multivariate Gaussian with mean $\mu$ and covariance matrix $BB^T$ if it can be written

Y = \mu + BZ

where $Z$ is a vector if i.i.d. standard Gaussian random variables. This definition is justified by noting

\begin{align*} E[Y_i] &= E[\mu_i + (BZ)_i]\\ &= \mu_i + E\left [\sum_{j} B_{ij}Z_j\right]\\ &= \mu_i + \sum_{j} B_{ij} E[Z_j]\\ &= \mu_i \end{align*}

and

\begin{align*} \text{cov}(Y_i, Y_j) &= E[(Y_i - \mu_i)(Y_j - \mu_j)]\\ &= E[(BZ)_i(BZ)_j]\\ &= E\left [\left(\sum_{k} B_{ik}Z_k\right)\left(\sum_{l} B_{jl}Z_l\right)\right]\\ &= E\left [\sum_{k}\sum_l B_{ik}B_{jl} Z_kZ_l\right]\\ &= \sum_{k}\sum_l B_{ik}B_{jl} E\left[Z_kZ_l\right]\\ &= \sum_{k} B_{ik}B_{jk}\\ &= (BB^T)_{ij} \end{align*}

which shows that the mean and covariance are as desired.

Given any covariance matrix $\Sigma$ , we can construct a multivariate normal distribution by finding $B$ such that $\Sigma = BB^T$ . This is possible as long as $\Sigma$ is positive semidefinite (has all non-negative eigenvalues). One approach is to consider the spectral decomposition $\Sigma = U\Lambda U^T$ , where the columns of $U$ are unit eigenvectors and $\Lambda$ is the corresponding diagonal matrix of eigenvalues, and then take

B = U \Lambda^{1/2}

where $\Lambda^{1/2}$ is real-valued because we stipulated non-negative eigenvalues.

Other Stuff

Other stuff - see https://netfiles.uiuc.edu/jimarden/www/Classes/STAT510/chapter8.pdf.

any covariance matrix must be psd, and any psd matrix is viable for a multinormal, thus a multinormal can have any covariance matrix
easy to compute the pdf starting with the pdf of $Z$ and doing a transformation of variables (note $z = B^{-1}(y-\mu)$ )
the characteristic function is the Fourier transform of the density.
affine transformations remain multinormal by defn
another way to look at things is to note that the contours of a multinormal are ellipsoids with principal axes given by the eigenvectors of $\Sigma$ , so we can rotate it by $U^{-1}$ (and scale by $\Lambda^{-1/2}$ and shift by $-\mu$ ) to get a standard normal.

Operations on Multivariate Gaussians

Affine Transformations

As noted above, we can think of any multivariate Gaussian $Y$ as some transformation $Y = \mu+BX$ of a standard Gaussian $X$ . Now let $Y$ be any multivariate Gaussian and consider an affine transformation $Z = PY +b$ . Here we have $Z = (P\mu +b) + PBX$ , so $Z$ is still multivariate Gaussian with mean $P\mu+b$ and covariance $PBB^TP^T = P\Sigma P^T$ .

Implicit Linear Transformations

Suppose we have a Gaussian density on $\mathbf{a}$ whose mean is a (not necessarily invertible) linear transform of another vector $\mathbf{b}$ ,

\mathcal{N}(\mathbf{a}; P\mathbf{b}, \Sigma).

If we instead treat $\mathbf{a}$ as fixed, this becomes a Gaussian density on $\mathbf{b}$ ,

N\left(\mathbf{b}; (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a}, (P^T\Sigma^{-1}P)^{-1}\right).

Proof: write out the density and complete the square,

\begin{align*} \mathcal{N}(\mathbf{a}; P\mathbf{b}, \Sigma) &\propto \exp\left(-\frac{1}{2} (P\mathbf{b}-a)^T\Sigma^{-1}(P\mathbf{b}-\mathbf{a}) \right)\\ &\propto \exp\left(-\frac{1}{2}\mathbf{b}^TP^T \Sigma^{-1} P\mathbf{b} - \mathbf{a}\Sigma^{-1}P\mathbf{b} \right)\\ &\propto \exp\left(-\frac{1}{2}\left(\mathbf{b} - (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a} \right)^T P^T \Sigma^{-1} P \left(\mathbf{b} - (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a} \right)\right)\\ &\propto N\left(\mathbf{b}; (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a}, (P^T\Sigma^{-1}P)^{-1}\right) \end{align*}

Products

The product of a Gaussian in $\mathbf{x}$ with a Gaussian in a linear projection of $\mathbf{x}$ is an unnormalized Gaussian,

\mathcal{N}(\mathbf{x}; \mathbf{a}, A)\mathcal{N}(P\mathbf{x}; \mathbf{b}, B) = z_c\mathcal{N}(\mathbf{x}; \mathbf{c}, C),

with

C = \left(A^{-1} + P^T B^{-1}P\right)^{-1}, \qquad \mathbf{c} = C \left(A^{-1}\mathbf{a} + P^TB^{-1}\mathbf{b}\right)

and with the normalization constant given by a Gaussian density

z_c = \mathcal{N}\left(\mathbf{b}; P\mathbf{a}, B + PAP^T \right).

Proof: write out the densities and complete the square.

\begin{align*} N(x; a, A)N(b; Px, B) &\propto \exp\left(-\frac{1}{2} (x-a)^TA^{-1} (x-a)\right)\exp\left(-\frac{1}{2} (Px-b)^TB^{-1} (Px-b)\right)\\ &\propto \exp\left(-\frac{1}{2} \left( x^T A^{-1}x - 2a^TA^{-1}x + x^T P^T B^{-1}P x - 2b^TB^{-1}Px \right)\right)\\ &\propto \exp\left(-\frac{1}{2} x^T \left(A^{-1} + P^T B^{-1}P\right)x - \left(a^TA^{-1} + b^TB^{-1}P\right)x\right)\\ &\propto \exp\left(-\frac{1}{2} x^T C^{-1} x - ( C^{-1} c^T) x\right)\\ &\propto \exp\left(-\frac{1}{2} (x -c)^T C^{-1}(x-c) \right)\\ &\propto N(x; c, C). \end{align*}

The full derivation giving the normalizing constant is more annoying. But here it is just for reference:

\begin{align*} N(x; a, A)N(b; Px, B) &= \frac{1}{(2\pi)^{k/2}|A|^{1/2}} \exp\left(-\frac{1}{2} (x-a)^TA^{-1} (x-a)\right) \frac{1}{(2\pi)^{k/2}|A|^{1/2}} \exp\left(-\frac{1}{2} (Px-b)^TB^{-1} (Px-b)\right)\\ &= \frac{1}{(2\pi)^{k}\sqrt{|A||B|}} \exp\left(-\frac{1}{2} \left( x^T A^{-1}x - 2a^TA^{-1}x + a^T A^{-1} a + x^T P^T B^{-1}P x - 2b^TB^{-1}Px +b^T B^{-1} b\right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}}\exp\left(-\frac{1}{2} \left(x^T C^{-1} x - ( C^{-1} c^T) x + a^T A^{-1} a + b^T B^{-1} b \right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}}\exp\left(-\frac{1}{2} (x -c)^T C^{-1}(x-c) \right) \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}} \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) (2\pi)^{k/2}|C|^{1/2} \mathcal{N}(\mathbf{x}; \mathbf{c}, \mathbf{C}) \end{align*}

The final $\mathcal{N}(\mathbf{x}; \mathbf{c}, \mathbf{C})$ is normalized, so it goes to 1 when we integrate over $\mathbf{x}$ . We are left with

\begin{align*} z_C &= (2\pi)^{-k/2}\sqrt{\frac{|C|}{|A||B|}} \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) , \end{align*}

where we have $\frac{|C|}{|A||B|} = |PAP^T + B|$ immediately by the matrix determinant lemma, so it remains only to treat the exponential component:

\begin{align*} z_C^\text{exp} &= \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) \\ &= \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - (A^{-1} a + P^T B^{-1} b)^T C (A^{-1} a + P^T B^{-1} b) \right)\right) \\ &= \exp\left(-\frac{1}{2} \left(a^T (A^{-1} - A^{-1}CA^{-1}) a + b^T (B^{-1} - B^{-1} P C P^T B^{-1}) b + 2 b^T (B^{-1} P C A^{-1}) a \right)\right). \end{align*}

Applying the matrix inversion lemma (and its one-sided analogue) we get

\begin{align*} A^{-1} - A^{-1} C A^{-1} &= A^{-1} - A^{-1} (A - AP^T (B + PAP^T)^{-1} P A) A^{-1} \\ &= P^T (B +PAP^T)^{-1} P,\\ (B^{-1} - B^{-1} P C P^T B^{-1}) &= (B + PAP^T)^{-1},\\ B^{-1} P C A^{-1} &= (B + PAP^T)^{-1} P\end{align*}

so we can rewrite the exponential as

\begin{align*} z_C^\text{exp} &= \exp\left(-\frac{1}{2} \left(a^T P^T (B +PAP^T)^{-1} P a + b^T (B + PAP^T)^{-1} b + 2 b^T (B + PAP^T)^{-1} P a \right)\right)\\ &= \exp\left(-\frac{1}{2} \left(Pa - b)^T (B +PAP^T)^{-1} (Pa - b) \right)\right) \end{align*}

which matches our desire, and completes the proof!

Multiple Products

The previous procedure can be iterated:

\mathcal{N}(\mathbf{x}; \mathbf{a}, A) \prod_{i=1}^n \mathcal{N}(P_{(i)}\mathbf{x}; \mathbf{b}_{(i)}, B_{(i)}) = z_c\mathcal{N}(\mathbf{x}; \mathbf{c}, C),

C = \left(A^{-1} + \sum_{i=1}^n P_{(i)}^T B_{(i)}^{-1}P_{(i)}\right)^{-1}, \qquad \mathbf{c} = C \left(A^{-1}\mathbf{a} + \sum_{i=1}^n P_{(i)}^TB_{(i)}^{-1}\mathbf{b}_{(i)}\right)

This can be shown easily by induction. I don't think the normalizing constants have a clean iterative form (obviously they can just be computed recursively), but am not totally sure about this.

Quotients

See this post for a slightly modified version of the above derivation covering the quotient (rather than the product) of Gaussian densities.

Marginalization

Say we have $\mathbf{x} \sim \mathcal{N}(\mathbf{x}; \mathbf{a}, A)$ and $\mathbf{y}|\mathbf{x} \sim \mathcal{N}(\mathbf{y}; P\mathbf{x}+\mathbf{b}, B)$ , i.e. $Y$ is a linear gaussian function of $\mathbf{x}$ . Now we have

\begin{align*} p(\mathbf{y}) &= \int_{-\infty}^\infty p(\mathbf{y}|\mathbf{x}) p(\mathbf{x}) d\mathbf{x}\\ &= \int_{-\infty}^\infty \mathcal{N}(\mathbf{y}; P\mathbf{x}+\mathbf{b}, B)\mathcal{N}(\mathbf{x}; \mathbf{a}, A) d\mathbf{x}\\ & \qquad \text{Now applying the product property above, we get}\\ &= z_c \int_{-\infty}^\infty \mathcal{N}(\mathbf{x}; c, C) d\mathbf{x}\\ & \qquad \text{and the Gaussian density integrates to one, leaving}\\ &= z_c = \mathcal{N}(\mathbf{y}; P\mathbf{a} + \mathbf{b}, B + PAP^T). \end{align*}

There's also a direct approach. Write $\mathbf{y} = P\mathbf{x}+\mathbf{b}+\varepsilon$ , where $\varepsilon \sim \mathcal{N}(0, B)$ . Now we know $P\mathbf{x}+\mathbf{b} \sim \mathcal{N}(P\mathbf{a} + b, PAP^T)$ by the affine transformation property above, so we just have the sum of two Gaussians, whose means and covariances add. Thus we conclude $p(\mathbf{y}) = \mathcal{N}(\mathbf{y}; P\mathbf{a} + \mathbf{b}, B + PAP^T)$ .

Conditioning

Suppose ${\bf a}$ and ${\bf b}$ are vectors jointly Gaussian distributed:

\left[\begin{array}{c}{\bf a}\\{\bf b}\end{array}\right] \sim \mathcal{N}\left(\left[\begin{array}{c}\mu_a\\\mu_b\end{array}\right], \left[\begin{array}{cc}A & C^T\\C & B\end{array}\right]\right)

Then the distribution of ${\bf a} | {\bf b}$ is given by

P({\bf a} | {\bf b}) = \mathcal{N}\left(\mu_a + C^T B^{-1} ({\bf b} - \mu_b), A - C^T B^{-1} C\right).

See http://gbhqed.wordpress.com/2010/02/21/conditional-and-marginal-distributions-of-a-multivariate-gaussian/ or http://cs229.stanford.edu/section/more_on_gaussians.pdf for proof.

Observations with Gaussian noise

Let $X \sim \mathcal{N}(\mu_x, \Sigma_x)$ and $Y = X + \varepsilon$ , with $\varepsilon \sim \mathcal{N}(0, \Sigma_\varepsilon)$ , so $Y \sim \mathcal{N}(\mu_x, \Sigma_x + \Sigma_\varepsilon)$ and $X, Y$ are jointly Gaussian. Then

p(Y=y|X=x) = \mathcal{N}(y; x, \Sigma_\varepsilon).

To reverse the conditioning, we have

\begin{align*} p(X=x|Y=y) &\propto p(Y=y | X=x)p(X=x)\\ &= \mathcal{N}(y; x, \Sigma_\varepsilon)\mathcal{N}(x; \mu_x, \Sigma_x)\\ &= \mathcal{N}(x; y, \Sigma_\varepsilon)\mathcal{N}(x; \mu_x, \Sigma_x) \text{ by symmetry of the Gaussian density}\\ &\propto \mathcal{N}(\Sigma\left(\Sigma_\varepsilon^{-1} y + \Sigma_x^{-1} \mu_x\right), \Sigma) \end{align*}

for $\Sigma = \left(\Sigma_\varepsilon^{-1} + \Sigma_x^{-1}\right)^{-1}.$ The last step is from the multiplication rule above. Note that since we are computing a normalized probability distribution, the final normalizing constant will just be 1.