Created: February 09, 2014
Modified: March 16, 2022

Gaussian process

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Weight-Space View

Recall standard linear regression. We suppose $f(x) = x^Tw$ and $y = f(x) = \epsilon$ where $\epsilon \sim N(0, \sigma^2)$ , where $x$ can be augmented with an implicit 1 term to allow a bias to be learned. This gives us a likelihood of observing any particular set of data points, given weights. If we're Bayesian we'll also need to put a prior on the weights; in particular let's take $w\sim N(0, \Sigma)$ for some covariance matrix $\Sigma$ . Then we can use Bayes' rule to calculate the posterior weights given data; this will turn out to be Gaussian:

p(w | X, y) \sim N\left(\frac{A^{-1}Xy}{\sigma^2}, A^{-1}\right)

where $A = \frac{XX^T}{\sigma^2} + \Sigma^{-1}$ . We can finally calculate the predictive distribution, which is again Gaussian:

p(f_*|x_*,X,y) = \int p(f_*|x^*, w)p(w | X,y)dw = N\left(\frac{x_*^TA^{-1}Xy}{\sigma^2}, x_*^TA^{-1}x_*\right).

Suppose we want to map our data into a higher-dimensional feature space using $\phi(x)$ instead of $x$ . For brevity, we'll write $\phi$ instead of $\phi(x)$ and $\Phi$ instead of $\phi(X)$ :

p(f_*|x_*,X,y) \sim N\left(\frac{\phi_*^TA^{-1}\Phi y}{\sigma^2}, \phi_*^TA^{-1}\phi_*\right)

where now $A = \frac{\Phi\Phi^T}{\sigma^2} + \Sigma^{-1}$ . Now we define $K = \Phi^T \Sigma \Phi$ . It is then possible to show that the mean and variance given above are equal to

p(f_*|x_*,X,y) \sim N\left(\phi_*^T\Sigma \Phi (K + \sigma^2 I)^{-1}y, \phi_*^T\Sigma\phi_* - \phi_*^T\Sigma\Phi(K + \sigma^2I)^{-1}\Phi^T\Sigma\phi_*\right)

(for the mean it's simple algebra, the covariance requires the matrix inversion lemma). Now let's define $k(x,x') = \phi(x)^T\Sigma\phi(x')$ ; note that we can write the predictive distribution such that all of the data passes through $k$ , which we call the covariance function. It turns out that this gives exactly the same predictive distribution as the function-space view of a Gaussian process.

Gaussian process

Weight-Space View

Links to this note

kernel

the best things have many stories

matrix inversion lemma

Meta