multivariate gaussian: Nonlinear Function
Created: February 09, 2014
Modified: March 16, 2022

multivariate gaussian

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

We say that a random vector YY is multivariate Gaussian with mean μ\mu and covariance matrix BBTBB^T if it can be written

Y=μ+BZY = \mu + BZ

where ZZ is a vector if i.i.d. standard Gaussian random variables. This definition is justified by noting

E[Yi]=E[μi+(BZ)i]=μi+E[jBijZj]=μi+jBijE[Zj]=μi\begin{align*} E[Y_i] &= E[\mu_i + (BZ)_i]\\ &= \mu_i + E\left [\sum_{j} B_{ij}Z_j\right]\\ &= \mu_i + \sum_{j} B_{ij} E[Z_j]\\ &= \mu_i \end{align*}

and

cov(Yi,Yj)=E[(Yiμi)(Yjμj)]=E[(BZ)i(BZ)j]=E[(kBikZk)(lBjlZl)]=E[klBikBjlZkZl]=klBikBjlE[ZkZl]=kBikBjk=(BBT)ij\begin{align*} \text{cov}(Y_i, Y_j) &= E[(Y_i - \mu_i)(Y_j - \mu_j)]\\ &= E[(BZ)_i(BZ)_j]\\ &= E\left [\left(\sum_{k} B_{ik}Z_k\right)\left(\sum_{l} B_{jl}Z_l\right)\right]\\ &= E\left [\sum_{k}\sum_l B_{ik}B_{jl} Z_kZ_l\right]\\ &= \sum_{k}\sum_l B_{ik}B_{jl} E\left[Z_kZ_l\right]\\ &= \sum_{k} B_{ik}B_{jk}\\ &= (BB^T)_{ij} \end{align*}

which shows that the mean and covariance are as desired.

Given any covariance matrix Σ\Sigma, we can construct a multivariate normal distribution by finding BB such that Σ=BBT\Sigma = BB^T. This is possible as long as Σ\Sigma is positive semidefinite (has all non-negative eigenvalues). One approach is to consider the spectral decomposition Σ=UΛUT\Sigma = U\Lambda U^T, where the columns of UU are unit eigenvectors and Λ\Lambda is the corresponding diagonal matrix of eigenvalues, and then take

B=UΛ1/2B = U \Lambda^{1/2}

where Λ1/2\Lambda^{1/2} is real-valued because we stipulated non-negative eigenvalues.

Other Stuff

Other stuff - see https://netfiles.uiuc.edu/jimarden/www/Classes/STAT510/chapter8.pdf.

  • any covariance matrix must be psd, and any psd matrix is viable for a multinormal, thus a multinormal can have any covariance matrix
  • easy to compute the pdf starting with the pdf of ZZ and doing a transformation of variables (note z=B1(yμ)z = B^{-1}(y-\mu))
  • the characteristic function is the Fourier transform of the density.
  • affine transformations remain multinormal by defn
  • another way to look at things is to note that the contours of a multinormal are ellipsoids with principal axes given by the eigenvectors of Σ\Sigma, so we can rotate it by U1U^{-1} (and scale by Λ1/2\Lambda^{-1/2}and shift by μ-\mu) to get a standard normal.

Operations on Multivariate Gaussians

Affine Transformations

As noted above, we can think of any multivariate Gaussian YY as some transformation Y=μ+BXY = \mu+BX of a standard Gaussian XX. Now let YY be any multivariate Gaussian and consider an affine transformation Z=PY+bZ = PY +b. Here we have Z=(Pμ+b)+PBXZ = (P\mu +b) + PBX, so ZZ is still multivariate Gaussian with mean Pμ+bP\mu+b and covariance PBBTPT=PΣPTPBB^TP^T = P\Sigma P^T.

Implicit Linear Transformations

Suppose we have a Gaussian density on a\mathbf{a} whose mean is a (not necessarily invertible) linear transform of another vector b\mathbf{b},

N(a;Pb,Σ).\mathcal{N}(\mathbf{a}; P\mathbf{b}, \Sigma).

If we instead treat a\mathbf{a} as fixed, this becomes a Gaussian density on b\mathbf{b},

N(b;(PTΣ1P)1PTΣ1a,(PTΣ1P)1).N\left(\mathbf{b}; (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a}, (P^T\Sigma^{-1}P)^{-1}\right).

Proof: write out the density and complete the square,

N(a;Pb,Σ)exp(12(Pba)TΣ1(Pba))exp(12bTPTΣ1PbaΣ1Pb)exp(12(b(PTΣ1P)1PTΣ1a)TPTΣ1P(b(PTΣ1P)1PTΣ1a))N(b;(PTΣ1P)1PTΣ1a,(PTΣ1P)1)\begin{align*} \mathcal{N}(\mathbf{a}; P\mathbf{b}, \Sigma) &\propto \exp\left(-\frac{1}{2} (P\mathbf{b}-a)^T\Sigma^{-1}(P\mathbf{b}-\mathbf{a}) \right)\\ &\propto \exp\left(-\frac{1}{2}\mathbf{b}^TP^T \Sigma^{-1} P\mathbf{b} - \mathbf{a}\Sigma^{-1}P\mathbf{b} \right)\\ &\propto \exp\left(-\frac{1}{2}\left(\mathbf{b} - (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a} \right)^T P^T \Sigma^{-1} P \left(\mathbf{b} - (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a} \right)\right)\\ &\propto N\left(\mathbf{b}; (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a}, (P^T\Sigma^{-1}P)^{-1}\right) \end{align*}

Products

The product of a Gaussian in x\mathbf{x} with a Gaussian in a linear projection of x\mathbf{x} is an unnormalized Gaussian,

N(x;a,A)N(Px;b,B)=zcN(x;c,C),\mathcal{N}(\mathbf{x}; \mathbf{a}, A)\mathcal{N}(P\mathbf{x}; \mathbf{b}, B) = z_c\mathcal{N}(\mathbf{x}; \mathbf{c}, C),

with

C=(A1+PTB1P)1,c=C(A1a+PTB1b)C = \left(A^{-1} + P^T B^{-1}P\right)^{-1}, \qquad \mathbf{c} = C \left(A^{-1}\mathbf{a} + P^TB^{-1}\mathbf{b}\right)

and with the normalization constant given by a Gaussian density

zc=N(b;Pa,B+PAPT).z_c = \mathcal{N}\left(\mathbf{b}; P\mathbf{a}, B + PAP^T \right).

Proof: write out the densities and complete the square.

N(x;a,A)N(b;Px,B)exp(12(xa)TA1(xa))exp(12(Pxb)TB1(Pxb))exp(12(xTA1x2aTA1x+xTPTB1Px2bTB1Px))exp(12xT(A1+PTB1P)x(aTA1+bTB1P)x)exp(12xTC1x(C1cT)x)exp(12(xc)TC1(xc))N(x;c,C).\begin{align*} N(x; a, A)N(b; Px, B) &\propto \exp\left(-\frac{1}{2} (x-a)^TA^{-1} (x-a)\right)\exp\left(-\frac{1}{2} (Px-b)^TB^{-1} (Px-b)\right)\\ &\propto \exp\left(-\frac{1}{2} \left( x^T A^{-1}x - 2a^TA^{-1}x + x^T P^T B^{-1}P x - 2b^TB^{-1}Px \right)\right)\\ &\propto \exp\left(-\frac{1}{2} x^T \left(A^{-1} + P^T B^{-1}P\right)x - \left(a^TA^{-1} + b^TB^{-1}P\right)x\right)\\ &\propto \exp\left(-\frac{1}{2} x^T C^{-1} x - ( C^{-1} c^T) x\right)\\ &\propto \exp\left(-\frac{1}{2} (x -c)^T C^{-1}(x-c) \right)\\ &\propto N(x; c, C). \end{align*}

The full derivation giving the normalizing constant is more annoying. But here it is just for reference:

N(x;a,A)N(b;Px,B)=1(2π)k/2A1/2exp(12(xa)TA1(xa))1(2π)k/2A1/2exp(12(Pxb)TB1(Pxb))=1(2π)kABexp(12(xTA1x2aTA1x+aTA1a+xTPTB1Px2bTB1Px+bTB1b))=1(2π)kABexp(12(xTC1x(C1cT)x+aTA1a+bTB1b))=1(2π)kABexp(12(xc)TC1(xc))exp(12(aTA1a+bTB1bcTC1c))=1(2π)kABexp(12(aTA1a+bTB1bcTC1c))(2π)k/2C1/2N(x;c,C)\begin{align*} N(x; a, A)N(b; Px, B) &= \frac{1}{(2\pi)^{k/2}|A|^{1/2}} \exp\left(-\frac{1}{2} (x-a)^TA^{-1} (x-a)\right) \frac{1}{(2\pi)^{k/2}|A|^{1/2}} \exp\left(-\frac{1}{2} (Px-b)^TB^{-1} (Px-b)\right)\\ &= \frac{1}{(2\pi)^{k}\sqrt{|A||B|}} \exp\left(-\frac{1}{2} \left( x^T A^{-1}x - 2a^TA^{-1}x + a^T A^{-1} a + x^T P^T B^{-1}P x - 2b^TB^{-1}Px +b^T B^{-1} b\right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}}\exp\left(-\frac{1}{2} \left(x^T C^{-1} x - ( C^{-1} c^T) x + a^T A^{-1} a + b^T B^{-1} b \right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}}\exp\left(-\frac{1}{2} (x -c)^T C^{-1}(x-c) \right) \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}} \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) (2\pi)^{k/2}|C|^{1/2} \mathcal{N}(\mathbf{x}; \mathbf{c}, \mathbf{C}) \end{align*}

The final N(x;c,C)\mathcal{N}(\mathbf{x}; \mathbf{c}, \mathbf{C}) is normalized, so it goes to 1 when we integrate over x\mathbf{x}. We are left with

zC=(2π)k/2CABexp(12(aTA1a+bTB1bcTC1c)),\begin{align*} z_C &= (2\pi)^{-k/2}\sqrt{\frac{|C|}{|A||B|}} \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) , \end{align*}

where we have CAB=PAPT+B\frac{|C|}{|A||B|} = |PAP^T + B| immediately by the matrix determinant lemma, so it remains only to treat the exponential component:

zCexp=exp(12(aTA1a+bTB1bcTC1c))=exp(12(aTA1a+bTB1b(A1a+PTB1b)TC(A1a+PTB1b)))=exp(12(aT(A1A1CA1)a+bT(B1B1PCPTB1)b+2bT(B1PCA1)a)).\begin{align*} z_C^\text{exp} &= \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) \\ &= \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - (A^{-1} a + P^T B^{-1} b)^T C (A^{-1} a + P^T B^{-1} b) \right)\right) \\ &= \exp\left(-\frac{1}{2} \left(a^T (A^{-1} - A^{-1}CA^{-1}) a + b^T (B^{-1} - B^{-1} P C P^T B^{-1}) b + 2 b^T (B^{-1} P C A^{-1}) a \right)\right). \end{align*}

Applying the matrix inversion lemma (and its one-sided analogue) we get

A1A1CA1=A1A1(AAPT(B+PAPT)1PA)A1=PT(B+PAPT)1P,(B1B1PCPTB1)=(B+PAPT)1,B1PCA1=(B+PAPT)1P\begin{align*} A^{-1} - A^{-1} C A^{-1} &= A^{-1} - A^{-1} (A - AP^T (B + PAP^T)^{-1} P A) A^{-1} \\ &= P^T (B +PAP^T)^{-1} P,\\ (B^{-1} - B^{-1} P C P^T B^{-1}) &= (B + PAP^T)^{-1},\\ B^{-1} P C A^{-1} &= (B + PAP^T)^{-1} P\end{align*}

so we can rewrite the exponential as

zCexp=exp(12(aTPT(B+PAPT)1Pa+bT(B+PAPT)1b+2bT(B+PAPT)1Pa))=exp(12(Pab)T(B+PAPT)1(Pab)))\begin{align*} z_C^\text{exp} &= \exp\left(-\frac{1}{2} \left(a^T P^T (B +PAP^T)^{-1} P a + b^T (B + PAP^T)^{-1} b + 2 b^T (B + PAP^T)^{-1} P a \right)\right)\\ &= \exp\left(-\frac{1}{2} \left(Pa - b)^T (B +PAP^T)^{-1} (Pa - b) \right)\right) \end{align*}

which matches our desire, and completes the proof!

Multiple Products

The previous procedure can be iterated:

N(x;a,A)i=1nN(P(i)x;b(i),B(i))=zcN(x;c,C),\mathcal{N}(\mathbf{x}; \mathbf{a}, A) \prod_{i=1}^n \mathcal{N}(P_{(i)}\mathbf{x}; \mathbf{b}_{(i)}, B_{(i)}) = z_c\mathcal{N}(\mathbf{x}; \mathbf{c}, C),
C=(A1+i=1nP(i)TB(i)1P(i))1,c=C(A1a+i=1nP(i)TB(i)1b(i))C = \left(A^{-1} + \sum_{i=1}^n P_{(i)}^T B_{(i)}^{-1}P_{(i)}\right)^{-1}, \qquad \mathbf{c} = C \left(A^{-1}\mathbf{a} + \sum_{i=1}^n P_{(i)}^TB_{(i)}^{-1}\mathbf{b}_{(i)}\right)

This can be shown easily by induction. I don't think the normalizing constants have a clean iterative form (obviously they can just be computed recursively), but am not totally sure about this.

Quotients

See this post for a slightly modified version of the above derivation covering the quotient (rather than the product) of Gaussian densities.

Marginalization

Say we have xN(x;a,A)\mathbf{x} \sim \mathcal{N}(\mathbf{x}; \mathbf{a}, A) and yxN(y;Px+b,B)\mathbf{y}|\mathbf{x} \sim \mathcal{N}(\mathbf{y}; P\mathbf{x}+\mathbf{b}, B), i.e. YY is a linear gaussian function of x\mathbf{x}. Now we have

p(y)=p(yx)p(x)dx=N(y;Px+b,B)N(x;a,A)dxNow applying the product property above, we get=zcN(x;c,C)dxand the Gaussian density integrates to one, leaving=zc=N(y;Pa+b,B+PAPT).\begin{align*} p(\mathbf{y}) &= \int_{-\infty}^\infty p(\mathbf{y}|\mathbf{x}) p(\mathbf{x}) d\mathbf{x}\\ &= \int_{-\infty}^\infty \mathcal{N}(\mathbf{y}; P\mathbf{x}+\mathbf{b}, B)\mathcal{N}(\mathbf{x}; \mathbf{a}, A) d\mathbf{x}\\ & \qquad \text{Now applying the product property above, we get}\\ &= z_c \int_{-\infty}^\infty \mathcal{N}(\mathbf{x}; c, C) d\mathbf{x}\\ & \qquad \text{and the Gaussian density integrates to one, leaving}\\ &= z_c = \mathcal{N}(\mathbf{y}; P\mathbf{a} + \mathbf{b}, B + PAP^T). \end{align*}

There's also a direct approach. Write y=Px+b+ε\mathbf{y} = P\mathbf{x}+\mathbf{b}+\varepsilon, where εN(0,B)\varepsilon \sim \mathcal{N}(0, B). Now we know Px+bN(Pa+b,PAPT)P\mathbf{x}+\mathbf{b} \sim \mathcal{N}(P\mathbf{a} + b, PAP^T) by the affine transformation property above, so we just have the sum of two Gaussians, whose means and covariances add. Thus we conclude p(y)=N(y;Pa+b,B+PAPT)p(\mathbf{y}) = \mathcal{N}(\mathbf{y}; P\mathbf{a} + \mathbf{b}, B + PAP^T).

Conditioning

Suppose a{\bf a} and b{\bf b} are vectors jointly Gaussian distributed:

[ab]N([μaμb],[ACTCB])\left[\begin{array}{c}{\bf a}\\{\bf b}\end{array}\right] \sim \mathcal{N}\left(\left[\begin{array}{c}\mu_a\\\mu_b\end{array}\right], \left[\begin{array}{cc}A & C^T\\C & B\end{array}\right]\right)

Then the distribution of ab{\bf a} | {\bf b} is given by

P(ab)=N(μa+CTB1(bμb),ACTB1C).P({\bf a} | {\bf b}) = \mathcal{N}\left(\mu_a + C^T B^{-1} ({\bf b} - \mu_b), A - C^T B^{-1} C\right).

See http://gbhqed.wordpress.com/2010/02/21/conditional-and-marginal-distributions-of-a-multivariate-gaussian/ or http://cs229.stanford.edu/section/more_on_gaussians.pdf for proof.

Observations with Gaussian noise

Let XN(μx,Σx)X \sim \mathcal{N}(\mu_x, \Sigma_x) and Y=X+εY = X + \varepsilon, with εN(0,Σε)\varepsilon \sim \mathcal{N}(0, \Sigma_\varepsilon), so YN(μx,Σx+Σε)Y \sim \mathcal{N}(\mu_x, \Sigma_x + \Sigma_\varepsilon) and X,YX, Y are jointly Gaussian. Then

p(Y=yX=x)=N(y;x,Σε).p(Y=y|X=x) = \mathcal{N}(y; x, \Sigma_\varepsilon).

To reverse the conditioning, we have

p(X=xY=y)p(Y=yX=x)p(X=x)=N(y;x,Σε)N(x;μx,Σx)=N(x;y,Σε)N(x;μx,Σx) by symmetry of the Gaussian densityN(Σ(Σε1y+Σx1μx),Σ)\begin{align*} p(X=x|Y=y) &\propto p(Y=y | X=x)p(X=x)\\ &= \mathcal{N}(y; x, \Sigma_\varepsilon)\mathcal{N}(x; \mu_x, \Sigma_x)\\ &= \mathcal{N}(x; y, \Sigma_\varepsilon)\mathcal{N}(x; \mu_x, \Sigma_x) \text{ by symmetry of the Gaussian density}\\ &\propto \mathcal{N}(\Sigma\left(\Sigma_\varepsilon^{-1} y + \Sigma_x^{-1} \mu_x\right), \Sigma) \end{align*}

for Σ=(Σε1+Σx1)1.\Sigma = \left(\Sigma_\varepsilon^{-1} + \Sigma_x^{-1}\right)^{-1}. The last step is from the multiplication rule above. Note that since we are computing a normalized probability distribution, the final normalizing constant will just be 1.