bias-variance tradeoff: Nonlinear Function
Created: May 14, 2022
Modified: September 28, 2023

bias-variance tradeoff

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

I think of "variance" as the error in a statistical estimate that comes from not having enough data (assuming an identifiable model, with infinite data your variance would go to zero), and "bias" as the remaining "irreducible" error that comes from having the 'wrong' hypothesis class: one that either doesn't contain the true value, or one that concentrates prior mass on non-true values.

The intuitive tradeoff here is that between weak and strong priors, or large vs small hypothesis classes. A strong prior adds bias, but reduces variance. A weak prior reduces bias but increases variance. Similarly, a small hypothesis class introduces bias but reduces variance, and conversely a large hypothesis class decreases bias and increases variance.


Ben Recht argues that "the bias-variance tradeoff is a useless tautology and not a tradeoff at all":

You take some prediction function F that you fit to some data. Let’s say that the best function you could have created if you had infinite data but the same code was G. And let’s say the best function you could have created if you had infinite data and infinite computation was H. Then

error[F] = (error[F]-error[G]) + (error[G]-error[H]) + error[H]

error[F]-error[G] is called the variance. error[G]-error[H] is called the bias. QED. That’s it, friends. That is the “bias-variance tradeoff.” There is nothing profound about it.  Do I have to isolate these terms when I do machine learning? Of course not. I will never know what that “H” function is, and it doesn’t matter.

The idea, I guess, is that with infinite computation we could search an arbitrarily large hypothesis class that would contain the true hypothesis, and with infinite data we would be able to isolate that hypothesis.