Gaussian process: Nonlinear Function
Created: February 09, 2014
Modified: March 16, 2022

Gaussian process

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Weight-Space View

Recall standard linear regression. We suppose f(x)=xTwf(x) = x^Tw and y=f(x)=ϵy = f(x) = \epsilon where ϵN(0,σ2)\epsilon \sim N(0, \sigma^2), where xx can be augmented with an implicit 1 term to allow a bias to be learned. This gives us a likelihood of observing any particular set of data points, given weights. If we're Bayesian we'll also need to put a prior on the weights; in particular let's take wN(0,Σ)w\sim N(0, \Sigma) for some covariance matrix Σ\Sigma. Then we can use Bayes' rule to calculate the posterior weights given data; this will turn out to be Gaussian:

p(wX,y)N(A1Xyσ2,A1)p(w | X, y) \sim N\left(\frac{A^{-1}Xy}{\sigma^2}, A^{-1}\right)

where A=XXTσ2+Σ1A = \frac{XX^T}{\sigma^2} + \Sigma^{-1}. We can finally calculate the predictive distribution, which is again Gaussian:

p(fx,X,y)=p(fx,w)p(wX,y)dw=N(xTA1Xyσ2,xTA1x).p(f_*|x_*,X,y) = \int p(f_*|x^*, w)p(w | X,y)dw = N\left(\frac{x_*^TA^{-1}Xy}{\sigma^2}, x_*^TA^{-1}x_*\right).

Suppose we want to map our data into a higher-dimensional feature space using ϕ(x)\phi(x) instead of xx. For brevity, we'll write ϕ\phi instead of ϕ(x)\phi(x) and Φ\Phi instead of ϕ(X)\phi(X):

p(fx,X,y)N(ϕTA1Φyσ2,ϕTA1ϕ)p(f_*|x_*,X,y) \sim N\left(\frac{\phi_*^TA^{-1}\Phi y}{\sigma^2}, \phi_*^TA^{-1}\phi_*\right)

where now A=ΦΦTσ2+Σ1A = \frac{\Phi\Phi^T}{\sigma^2} + \Sigma^{-1}. Now we define K=ΦTΣΦK = \Phi^T \Sigma \Phi. It is then possible to show that the mean and variance given above are equal to

p(fx,X,y)N(ϕTΣΦ(K+σ2I)1y,ϕTΣϕϕTΣΦ(K+σ2I)1ΦTΣϕ)p(f_*|x_*,X,y) \sim N\left(\phi_*^T\Sigma \Phi (K + \sigma^2 I)^{-1}y, \phi_*^T\Sigma\phi_* - \phi_*^T\Sigma\Phi(K + \sigma^2I)^{-1}\Phi^T\Sigma\phi_*\right)

(for the mean it's simple algebra, the covariance requires the matrix inversion lemma). Now let's define k(x,x)=ϕ(x)TΣϕ(x)k(x,x') = \phi(x)^T\Sigma\phi(x'); note that we can write the predictive distribution such that all of the data passes through kk, which we call the covariance function. It turns out that this gives exactly the same predictive distribution as the function-space view of a Gaussian process.