We say that a random vector Y Y Y is multivariate Gaussian with mean μ \mu μ and covariance matrix B B T BB^T B B T if it can be written
where Z Z Z is a vector if i.i.d. standard Gaussian random variables. This definition is justified by noting
E [ Y i ] = E [ μ i + ( B Z ) i ] = μ i + E [ ∑ j B i j Z j ] = μ i + ∑ j B i j E [ Z j ] = μ i \begin{align*} E[Y_i] &= E[\mu_i + (BZ)_i]\\ &= \mu_i + E\left [\sum_{j} B_{ij}Z_j\right]\\ &= \mu_i + \sum_{j} B_{ij} E[Z_j]\\ &= \mu_i \end{align*} E [ Y i ] = E [ μ i + ( BZ ) i ] = μ i + E [ j ∑ B ij Z j ] = μ i + j ∑ B ij E [ Z j ] = μ i and
cov ( Y i , Y j ) = E [ ( Y i − μ i ) ( Y j − μ j ) ] = E [ ( B Z ) i ( B Z ) j ] = E [ ( ∑ k B i k Z k ) ( ∑ l B j l Z l ) ] = E [ ∑ k ∑ l B i k B j l Z k Z l ] = ∑ k ∑ l B i k B j l E [ Z k Z l ] = ∑ k B i k B j k = ( B B T ) i j \begin{align*} \text{cov}(Y_i, Y_j) &= E[(Y_i - \mu_i)(Y_j - \mu_j)]\\ &= E[(BZ)_i(BZ)_j]\\ &= E\left [\left(\sum_{k} B_{ik}Z_k\right)\left(\sum_{l} B_{jl}Z_l\right)\right]\\ &= E\left [\sum_{k}\sum_l B_{ik}B_{jl} Z_kZ_l\right]\\ &= \sum_{k}\sum_l B_{ik}B_{jl} E\left[Z_kZ_l\right]\\ &= \sum_{k} B_{ik}B_{jk}\\ &= (BB^T)_{ij} \end{align*} cov ( Y i , Y j ) = E [( Y i − μ i ) ( Y j − μ j )] = E [( BZ ) i ( BZ ) j ] = E [ ( k ∑ B ik Z k ) ( l ∑ B j l Z l ) ] = E [ k ∑ l ∑ B ik B j l Z k Z l ] = k ∑ l ∑ B ik B j l E [ Z k Z l ] = k ∑ B ik B jk = ( B B T ) ij which shows that the mean and covariance are as desired.
Given any covariance matrix Σ \Sigma Σ , we can construct a multivariate normal distribution by finding B B B such that Σ = B B T \Sigma = BB^T Σ = B B T . This is possible as long as Σ \Sigma Σ is positive semidefinite (has all non-negative eigenvalues). One approach is to consider the spectral decomposition Σ = U Λ U T \Sigma = U\Lambda U^T Σ = U Λ U T , where the columns of U U U are unit eigenvectors and Λ \Lambda Λ is the corresponding diagonal matrix of eigenvalues, and then take
B = U Λ 1 / 2 B = U \Lambda^{1/2} B = U Λ 1/2 where Λ 1 / 2 \Lambda^{1/2} Λ 1/2 is real-valued because we stipulated non-negative eigenvalues.
Other StuffOther stuff - see https://netfiles.uiuc.edu/jimarden/www/Classes/STAT510/chapter8.pdf .
any covariance matrix must be psd, and any psd matrix is viable for a multinormal, thus a multinormal can have any covariance matrix easy to compute the pdf starting with the pdf of Z Z Z and doing a transformation of variables (note z = B − 1 ( y − μ ) z = B^{-1}(y-\mu) z = B − 1 ( y − μ ) ) the characteristic function is the Fourier transform of the density. affine transformations remain multinormal by defn another way to look at things is to note that the contours of a multinormal are ellipsoids with principal axes given by the eigenvectors of Σ \Sigma Σ , so we can rotate it by U − 1 U^{-1} U − 1 (and scale by Λ − 1 / 2 \Lambda^{-1/2} Λ − 1/2 and shift by − μ -\mu − μ ) to get a standard normal. Operations on Multivariate GaussiansAs noted above, we can think of any multivariate Gaussian Y Y Y as some transformation Y = μ + B X Y = \mu+BX Y = μ + BX of a standard Gaussian X X X . Now let Y Y Y be any multivariate Gaussian and consider an affine transformation Z = P Y + b Z = PY +b Z = P Y + b . Here we have Z = ( P μ + b ) + P B X Z = (P\mu +b) + PBX Z = ( P μ + b ) + PBX , so Z Z Z is still multivariate Gaussian with mean P μ + b P\mu+b P μ + b and covariance P B B T P T = P Σ P T PBB^TP^T = P\Sigma P^T PB B T P T = P Σ P T .
Suppose we have a Gaussian density on a \mathbf{a} a whose mean is a (not necessarily invertible) linear transform of another vector b \mathbf{b} b ,
N ( a ; P b , Σ ) . \mathcal{N}(\mathbf{a}; P\mathbf{b}, \Sigma). N ( a ; P b , Σ ) . If we instead treat a \mathbf{a} a as fixed, this becomes a Gaussian density on b \mathbf{b} b ,
N ( b ; ( P T Σ − 1 P ) − 1 P T Σ − 1 a , ( P T Σ − 1 P ) − 1 ) . N\left(\mathbf{b}; (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a}, (P^T\Sigma^{-1}P)^{-1}\right). N ( b ; ( P T Σ − 1 P ) − 1 P T Σ − 1 a , ( P T Σ − 1 P ) − 1 ) . Proof: write out the density and complete the square ,
N ( a ; P b , Σ ) ∝ exp ( − 1 2 ( P b − a ) T Σ − 1 ( P b − a ) ) ∝ exp ( − 1 2 b T P T Σ − 1 P b − a Σ − 1 P b ) ∝ exp ( − 1 2 ( b − ( P T Σ − 1 P ) − 1 P T Σ − 1 a ) T P T Σ − 1 P ( b − ( P T Σ − 1 P ) − 1 P T Σ − 1 a ) ) ∝ N ( b ; ( P T Σ − 1 P ) − 1 P T Σ − 1 a , ( P T Σ − 1 P ) − 1 ) \begin{align*} \mathcal{N}(\mathbf{a}; P\mathbf{b}, \Sigma) &\propto \exp\left(-\frac{1}{2} (P\mathbf{b}-a)^T\Sigma^{-1}(P\mathbf{b}-\mathbf{a}) \right)\\ &\propto \exp\left(-\frac{1}{2}\mathbf{b}^TP^T \Sigma^{-1} P\mathbf{b} - \mathbf{a}\Sigma^{-1}P\mathbf{b} \right)\\ &\propto \exp\left(-\frac{1}{2}\left(\mathbf{b} - (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a} \right)^T P^T \Sigma^{-1} P \left(\mathbf{b} - (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a} \right)\right)\\ &\propto N\left(\mathbf{b}; (P^T\Sigma^{-1}P)^{-1}P^T\Sigma^{-1}\mathbf{a}, (P^T\Sigma^{-1}P)^{-1}\right) \end{align*} N ( a ; P b , Σ ) ∝ exp ( − 2 1 ( P b − a ) T Σ − 1 ( P b − a ) ) ∝ exp ( − 2 1 b T P T Σ − 1 P b − a Σ − 1 P b ) ∝ exp ( − 2 1 ( b − ( P T Σ − 1 P ) − 1 P T Σ − 1 a ) T P T Σ − 1 P ( b − ( P T Σ − 1 P ) − 1 P T Σ − 1 a ) ) ∝ N ( b ; ( P T Σ − 1 P ) − 1 P T Σ − 1 a , ( P T Σ − 1 P ) − 1 ) ProductsThe product of a Gaussian in x \mathbf{x} x with a Gaussian in a linear projection of x \mathbf{x} x is an unnormalized Gaussian,
N ( x ; a , A ) N ( P x ; b , B ) = z c N ( x ; c , C ) , \mathcal{N}(\mathbf{x}; \mathbf{a}, A)\mathcal{N}(P\mathbf{x}; \mathbf{b}, B) = z_c\mathcal{N}(\mathbf{x}; \mathbf{c}, C), N ( x ; a , A ) N ( P x ; b , B ) = z c N ( x ; c , C ) , with
C = ( A − 1 + P T B − 1 P ) − 1 , c = C ( A − 1 a + P T B − 1 b ) C = \left(A^{-1} + P^T B^{-1}P\right)^{-1}, \qquad \mathbf{c} = C \left(A^{-1}\mathbf{a} + P^TB^{-1}\mathbf{b}\right) C = ( A − 1 + P T B − 1 P ) − 1 , c = C ( A − 1 a + P T B − 1 b ) and with the normalization constant given by a Gaussian density
z c = N ( b ; P a , B + P A P T ) . z_c = \mathcal{N}\left(\mathbf{b}; P\mathbf{a}, B + PAP^T \right). z c = N ( b ; P a , B + P A P T ) . Proof: write out the densities and complete the square.
N ( x ; a , A ) N ( b ; P x , B ) ∝ exp ( − 1 2 ( x − a ) T A − 1 ( x − a ) ) exp ( − 1 2 ( P x − b ) T B − 1 ( P x − b ) ) ∝ exp ( − 1 2 ( x T A − 1 x − 2 a T A − 1 x + x T P T B − 1 P x − 2 b T B − 1 P x ) ) ∝ exp ( − 1 2 x T ( A − 1 + P T B − 1 P ) x − ( a T A − 1 + b T B − 1 P ) x ) ∝ exp ( − 1 2 x T C − 1 x − ( C − 1 c T ) x ) ∝ exp ( − 1 2 ( x − c ) T C − 1 ( x − c ) ) ∝ N ( x ; c , C ) . \begin{align*} N(x; a, A)N(b; Px, B) &\propto \exp\left(-\frac{1}{2} (x-a)^TA^{-1} (x-a)\right)\exp\left(-\frac{1}{2} (Px-b)^TB^{-1} (Px-b)\right)\\ &\propto \exp\left(-\frac{1}{2} \left( x^T A^{-1}x - 2a^TA^{-1}x + x^T P^T B^{-1}P x - 2b^TB^{-1}Px \right)\right)\\ &\propto \exp\left(-\frac{1}{2} x^T \left(A^{-1} + P^T B^{-1}P\right)x - \left(a^TA^{-1} + b^TB^{-1}P\right)x\right)\\ &\propto \exp\left(-\frac{1}{2} x^T C^{-1} x - ( C^{-1} c^T) x\right)\\ &\propto \exp\left(-\frac{1}{2} (x -c)^T C^{-1}(x-c) \right)\\ &\propto N(x; c, C). \end{align*} N ( x ; a , A ) N ( b ; P x , B ) ∝ exp ( − 2 1 ( x − a ) T A − 1 ( x − a ) ) exp ( − 2 1 ( P x − b ) T B − 1 ( P x − b ) ) ∝ exp ( − 2 1 ( x T A − 1 x − 2 a T A − 1 x + x T P T B − 1 P x − 2 b T B − 1 P x ) ) ∝ exp ( − 2 1 x T ( A − 1 + P T B − 1 P ) x − ( a T A − 1 + b T B − 1 P ) x ) ∝ exp ( − 2 1 x T C − 1 x − ( C − 1 c T ) x ) ∝ exp ( − 2 1 ( x − c ) T C − 1 ( x − c ) ) ∝ N ( x ; c , C ) . The full derivation giving the normalizing constant is more
annoying. But here it is just for reference:
N ( x ; a , A ) N ( b ; P x , B ) = 1 ( 2 π ) k / 2 ∣ A ∣ 1 / 2 exp ( − 1 2 ( x − a ) T A − 1 ( x − a ) ) 1 ( 2 π ) k / 2 ∣ A ∣ 1 / 2 exp ( − 1 2 ( P x − b ) T B − 1 ( P x − b ) ) = 1 ( 2 π ) k ∣ A ∣ ∣ B ∣ exp ( − 1 2 ( x T A − 1 x − 2 a T A − 1 x + a T A − 1 a + x T P T B − 1 P x − 2 b T B − 1 P x + b T B − 1 b ) ) = 1 ( 2 π ) k ∣ A ∣ ∣ B ∣ exp ( − 1 2 ( x T C − 1 x − ( C − 1 c T ) x + a T A − 1 a + b T B − 1 b ) ) = 1 ( 2 π ) k ∣ A ∣ ∣ B ∣ exp ( − 1 2 ( x − c ) T C − 1 ( x − c ) ) exp ( − 1 2 ( a T A − 1 a + b T B − 1 b − c T C − 1 c ) ) = 1 ( 2 π ) k ∣ A ∣ ∣ B ∣ exp ( − 1 2 ( a T A − 1 a + b T B − 1 b − c T C − 1 c ) ) ( 2 π ) k / 2 ∣ C ∣ 1 / 2 N ( x ; c , C ) \begin{align*} N(x; a, A)N(b; Px, B) &= \frac{1}{(2\pi)^{k/2}|A|^{1/2}} \exp\left(-\frac{1}{2} (x-a)^TA^{-1} (x-a)\right) \frac{1}{(2\pi)^{k/2}|A|^{1/2}} \exp\left(-\frac{1}{2} (Px-b)^TB^{-1} (Px-b)\right)\\ &= \frac{1}{(2\pi)^{k}\sqrt{|A||B|}} \exp\left(-\frac{1}{2} \left( x^T A^{-1}x - 2a^TA^{-1}x + a^T A^{-1} a + x^T P^T B^{-1}P x - 2b^TB^{-1}Px +b^T B^{-1} b\right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}}\exp\left(-\frac{1}{2} \left(x^T C^{-1} x - ( C^{-1} c^T) x + a^T A^{-1} a + b^T B^{-1} b \right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}}\exp\left(-\frac{1}{2} (x -c)^T C^{-1}(x-c) \right) \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right)\\ &=\frac{1}{(2\pi)^{k}\sqrt{|A||B|}} \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) (2\pi)^{k/2}|C|^{1/2} \mathcal{N}(\mathbf{x}; \mathbf{c}, \mathbf{C}) \end{align*} N ( x ; a , A ) N ( b ; P x , B ) = ( 2 π ) k /2 ∣ A ∣ 1/2 1 exp ( − 2 1 ( x − a ) T A − 1 ( x − a ) ) ( 2 π ) k /2 ∣ A ∣ 1/2 1 exp ( − 2 1 ( P x − b ) T B − 1 ( P x − b ) ) = ( 2 π ) k ∣ A ∣∣ B ∣ 1 exp ( − 2 1 ( x T A − 1 x − 2 a T A − 1 x + a T A − 1 a + x T P T B − 1 P x − 2 b T B − 1 P x + b T B − 1 b ) ) = ( 2 π ) k ∣ A ∣∣ B ∣ 1 exp ( − 2 1 ( x T C − 1 x − ( C − 1 c T ) x + a T A − 1 a + b T B − 1 b ) ) = ( 2 π ) k ∣ A ∣∣ B ∣ 1 exp ( − 2 1 ( x − c ) T C − 1 ( x − c ) ) exp ( − 2 1 ( a T A − 1 a + b T B − 1 b − c T C − 1 c ) ) = ( 2 π ) k ∣ A ∣∣ B ∣ 1 exp ( − 2 1 ( a T A − 1 a + b T B − 1 b − c T C − 1 c ) ) ( 2 π ) k /2 ∣ C ∣ 1/2 N ( x ; c , C ) The final N ( x ; c , C ) \mathcal{N}(\mathbf{x}; \mathbf{c}, \mathbf{C}) N ( x ; c , C ) is normalized, so it
goes to 1 when we integrate over x \mathbf{x} x . We are left with
z C = ( 2 π ) − k / 2 ∣ C ∣ ∣ A ∣ ∣ B ∣ exp ( − 1 2 ( a T A − 1 a + b T B − 1 b − c T C − 1 c ) ) , \begin{align*} z_C &= (2\pi)^{-k/2}\sqrt{\frac{|C|}{|A||B|}} \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) , \end{align*} z C = ( 2 π ) − k /2 ∣ A ∣∣ B ∣ ∣ C ∣ exp ( − 2 1 ( a T A − 1 a + b T B − 1 b − c T C − 1 c ) ) , where we have ∣ C ∣ ∣ A ∣ ∣ B ∣ = ∣ P A P T + B ∣ \frac{|C|}{|A||B|} = |PAP^T + B| ∣ A ∣∣ B ∣ ∣ C ∣ = ∣ P A P T + B ∣ immediately by the
matrix determinant lemma, so it remains only to treat the exponential component:
z C exp = exp ( − 1 2 ( a T A − 1 a + b T B − 1 b − c T C − 1 c ) ) = exp ( − 1 2 ( a T A − 1 a + b T B − 1 b − ( A − 1 a + P T B − 1 b ) T C ( A − 1 a + P T B − 1 b ) ) ) = exp ( − 1 2 ( a T ( A − 1 − A − 1 C A − 1 ) a + b T ( B − 1 − B − 1 P C P T B − 1 ) b + 2 b T ( B − 1 P C A − 1 ) a ) ) . \begin{align*} z_C^\text{exp} &= \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - c^T C^{-1} c\right)\right) \\ &= \exp\left(-\frac{1}{2} \left(a^T A^{-1} a + b^T B^{-1} b - (A^{-1} a + P^T B^{-1} b)^T C (A^{-1} a + P^T B^{-1} b) \right)\right) \\ &= \exp\left(-\frac{1}{2} \left(a^T (A^{-1} - A^{-1}CA^{-1}) a + b^T (B^{-1} - B^{-1} P C P^T B^{-1}) b + 2 b^T (B^{-1} P C A^{-1}) a \right)\right). \end{align*} z C exp = exp ( − 2 1 ( a T A − 1 a + b T B − 1 b − c T C − 1 c ) ) = exp ( − 2 1 ( a T A − 1 a + b T B − 1 b − ( A − 1 a + P T B − 1 b ) T C ( A − 1 a + P T B − 1 b ) ) ) = exp ( − 2 1 ( a T ( A − 1 − A − 1 C A − 1 ) a + b T ( B − 1 − B − 1 PC P T B − 1 ) b + 2 b T ( B − 1 PC A − 1 ) a ) ) . Applying the matrix inversion lemma (and its one-sided analogue) we get
A − 1 − A − 1 C A − 1 = A − 1 − A − 1 ( A − A P T ( B + P A P T ) − 1 P A ) A − 1 = P T ( B + P A P T ) − 1 P , ( B − 1 − B − 1 P C P T B − 1 ) = ( B + P A P T ) − 1 , B − 1 P C A − 1 = ( B + P A P T ) − 1 P \begin{align*} A^{-1} - A^{-1} C A^{-1} &= A^{-1} - A^{-1} (A - AP^T (B + PAP^T)^{-1} P A) A^{-1} \\ &= P^T (B +PAP^T)^{-1} P,\\ (B^{-1} - B^{-1} P C P^T B^{-1}) &= (B + PAP^T)^{-1},\\ B^{-1} P C A^{-1} &= (B + PAP^T)^{-1} P\end{align*} A − 1 − A − 1 C A − 1 ( B − 1 − B − 1 PC P T B − 1 ) B − 1 PC A − 1 = A − 1 − A − 1 ( A − A P T ( B + P A P T ) − 1 P A ) A − 1 = P T ( B + P A P T ) − 1 P , = ( B + P A P T ) − 1 , = ( B + P A P T ) − 1 P so we can rewrite the exponential as
z C exp = exp ( − 1 2 ( a T P T ( B + P A P T ) − 1 P a + b T ( B + P A P T ) − 1 b + 2 b T ( B + P A P T ) − 1 P a ) ) = exp ( − 1 2 ( P a − b ) T ( B + P A P T ) − 1 ( P a − b ) ) ) \begin{align*} z_C^\text{exp} &= \exp\left(-\frac{1}{2} \left(a^T P^T (B +PAP^T)^{-1} P a + b^T (B + PAP^T)^{-1} b + 2 b^T (B + PAP^T)^{-1} P a \right)\right)\\ &= \exp\left(-\frac{1}{2} \left(Pa - b)^T (B +PAP^T)^{-1} (Pa - b) \right)\right) \end{align*} z C exp = exp ( − 2 1 ( a T P T ( B + P A P T ) − 1 P a + b T ( B + P A P T ) − 1 b + 2 b T ( B + P A P T ) − 1 P a ) ) = exp ( − 2 1 ( P a − b ) T ( B + P A P T ) − 1 ( P a − b ) ) ) which matches our desire, and completes the proof!
Multiple ProductsThe previous procedure can be iterated:
N ( x ; a , A ) ∏ i = 1 n N ( P ( i ) x ; b ( i ) , B ( i ) ) = z c N ( x ; c , C ) , \mathcal{N}(\mathbf{x}; \mathbf{a}, A) \prod_{i=1}^n \mathcal{N}(P_{(i)}\mathbf{x}; \mathbf{b}_{(i)}, B_{(i)}) = z_c\mathcal{N}(\mathbf{x}; \mathbf{c}, C), N ( x ; a , A ) i = 1 ∏ n N ( P ( i ) x ; b ( i ) , B ( i ) ) = z c N ( x ; c , C ) , C = ( A − 1 + ∑ i = 1 n P ( i ) T B ( i ) − 1 P ( i ) ) − 1 , c = C ( A − 1 a + ∑ i = 1 n P ( i ) T B ( i ) − 1 b ( i ) ) C = \left(A^{-1} + \sum_{i=1}^n P_{(i)}^T B_{(i)}^{-1}P_{(i)}\right)^{-1}, \qquad \mathbf{c} = C \left(A^{-1}\mathbf{a} + \sum_{i=1}^n P_{(i)}^TB_{(i)}^{-1}\mathbf{b}_{(i)}\right) C = ( A − 1 + i = 1 ∑ n P ( i ) T B ( i ) − 1 P ( i ) ) − 1 , c = C ( A − 1 a + i = 1 ∑ n P ( i ) T B ( i ) − 1 b ( i ) ) This can be shown easily by induction. I don't think the normalizing
constants have a clean iterative form (obviously they can just be
computed recursively), but am not totally sure about this.
QuotientsSee this post for a slightly modified version of the above derivation covering the quotient (rather than the product) of Gaussian densities.
MarginalizationSay we have x ∼ N ( x ; a , A ) \mathbf{x} \sim \mathcal{N}(\mathbf{x}; \mathbf{a}, A) x ∼ N ( x ; a , A ) and y ∣ x ∼ N ( y ; P x + b , B ) \mathbf{y}|\mathbf{x} \sim \mathcal{N}(\mathbf{y}; P\mathbf{x}+\mathbf{b}, B) y ∣ x ∼ N ( y ; P x + b , B ) , i.e. Y Y Y is a linear gaussian function of x \mathbf{x} x . Now we have
p ( y ) = ∫ − ∞ ∞ p ( y ∣ x ) p ( x ) d x = ∫ − ∞ ∞ N ( y ; P x + b , B ) N ( x ; a , A ) d x Now applying the product property above, we get = z c ∫ − ∞ ∞ N ( x ; c , C ) d x and the Gaussian density integrates to one, leaving = z c = N ( y ; P a + b , B + P A P T ) . \begin{align*} p(\mathbf{y}) &= \int_{-\infty}^\infty p(\mathbf{y}|\mathbf{x}) p(\mathbf{x}) d\mathbf{x}\\ &= \int_{-\infty}^\infty \mathcal{N}(\mathbf{y}; P\mathbf{x}+\mathbf{b}, B)\mathcal{N}(\mathbf{x}; \mathbf{a}, A) d\mathbf{x}\\ & \qquad \text{Now applying the product property above, we get}\\ &= z_c \int_{-\infty}^\infty \mathcal{N}(\mathbf{x}; c, C) d\mathbf{x}\\ & \qquad \text{and the Gaussian density integrates to one, leaving}\\ &= z_c = \mathcal{N}(\mathbf{y}; P\mathbf{a} + \mathbf{b}, B + PAP^T). \end{align*} p ( y ) = ∫ − ∞ ∞ p ( y ∣ x ) p ( x ) d x = ∫ − ∞ ∞ N ( y ; P x + b , B ) N ( x ; a , A ) d x Now applying the product property above, we get = z c ∫ − ∞ ∞ N ( x ; c , C ) d x and the Gaussian density integrates to one, leaving = z c = N ( y ; P a + b , B + P A P T ) . There's also a direct approach. Write y = P x + b + ε \mathbf{y} = P\mathbf{x}+\mathbf{b}+\varepsilon y = P x + b + ε , where ε ∼ N ( 0 , B ) \varepsilon \sim \mathcal{N}(0, B) ε ∼ N ( 0 , B ) . Now we know P x + b ∼ N ( P a + b , P A P T ) P\mathbf{x}+\mathbf{b} \sim \mathcal{N}(P\mathbf{a} + b, PAP^T) P x + b ∼ N ( P a + b , P A P T ) by the affine transformation property above, so we just have the sum of two Gaussians, whose means and covariances add. Thus we conclude p ( y ) = N ( y ; P a + b , B + P A P T ) p(\mathbf{y}) = \mathcal{N}(\mathbf{y}; P\mathbf{a} + \mathbf{b}, B + PAP^T) p ( y ) = N ( y ; P a + b , B + P A P T ) .
ConditioningSuppose a {\bf a} a and b {\bf b} b are vectors jointly Gaussian distributed:
[ a b ] ∼ N ( [ μ a μ b ] , [ A C T C B ] ) \left[\begin{array}{c}{\bf a}\\{\bf b}\end{array}\right] \sim \mathcal{N}\left(\left[\begin{array}{c}\mu_a\\\mu_b\end{array}\right], \left[\begin{array}{cc}A & C^T\\C & B\end{array}\right]\right) [ a b ] ∼ N ( [ μ a μ b ] , [ A C C T B ] ) Then the distribution of a ∣ b {\bf a} | {\bf b} a ∣ b is given by
P ( a ∣ b ) = N ( μ a + C T B − 1 ( b − μ b ) , A − C T B − 1 C ) . P({\bf a} | {\bf b}) = \mathcal{N}\left(\mu_a + C^T B^{-1} ({\bf b} - \mu_b), A - C^T B^{-1} C\right). P ( a ∣ b ) = N ( μ a + C T B − 1 ( b − μ b ) , A − C T B − 1 C ) . See http://gbhqed.wordpress.com/2010/02/21/conditional-and-marginal-distributions-of-a-multivariate-gaussian/ or http://cs229.stanford.edu/section/more_on_gaussians.pdf for proof.
Observations with Gaussian noiseLet X ∼ N ( μ x , Σ x ) X \sim \mathcal{N}(\mu_x, \Sigma_x) X ∼ N ( μ x , Σ x ) and Y = X + ε Y = X + \varepsilon Y = X + ε , with ε ∼ N ( 0 , Σ ε ) \varepsilon \sim \mathcal{N}(0, \Sigma_\varepsilon) ε ∼ N ( 0 , Σ ε ) , so Y ∼ N ( μ x , Σ x + Σ ε ) Y \sim \mathcal{N}(\mu_x, \Sigma_x + \Sigma_\varepsilon) Y ∼ N ( μ x , Σ x + Σ ε ) and X , Y X, Y X , Y are jointly Gaussian. Then
p ( Y = y ∣ X = x ) = N ( y ; x , Σ ε ) . p(Y=y|X=x) = \mathcal{N}(y; x, \Sigma_\varepsilon). p ( Y = y ∣ X = x ) = N ( y ; x , Σ ε ) . To reverse the conditioning, we have
p ( X = x ∣ Y = y ) ∝ p ( Y = y ∣ X = x ) p ( X = x ) = N ( y ; x , Σ ε ) N ( x ; μ x , Σ x ) = N ( x ; y , Σ ε ) N ( x ; μ x , Σ x ) by symmetry of the Gaussian density ∝ N ( Σ ( Σ ε − 1 y + Σ x − 1 μ x ) , Σ ) \begin{align*} p(X=x|Y=y) &\propto p(Y=y | X=x)p(X=x)\\ &= \mathcal{N}(y; x, \Sigma_\varepsilon)\mathcal{N}(x; \mu_x, \Sigma_x)\\ &= \mathcal{N}(x; y, \Sigma_\varepsilon)\mathcal{N}(x; \mu_x, \Sigma_x) \text{ by symmetry of the Gaussian density}\\ &\propto \mathcal{N}(\Sigma\left(\Sigma_\varepsilon^{-1} y + \Sigma_x^{-1} \mu_x\right), \Sigma) \end{align*} p ( X = x ∣ Y = y ) ∝ p ( Y = y ∣ X = x ) p ( X = x ) = N ( y ; x , Σ ε ) N ( x ; μ x , Σ x ) = N ( x ; y , Σ ε ) N ( x ; μ x , Σ x ) by symmetry of the Gaussian density ∝ N ( Σ ( Σ ε − 1 y + Σ x − 1 μ x ) , Σ ) for Σ = ( Σ ε − 1 + Σ x − 1 ) − 1 . \Sigma = \left(\Sigma_\varepsilon^{-1} + \Sigma_x^{-1}\right)^{-1}. Σ = ( Σ ε − 1 + Σ x − 1 ) − 1 .
The last step is from the multiplication rule above. Note that since we are computing a normalized probability distribution, the final normalizing constant will just be 1.