Bayesian: Nonlinear Function
Created: June 08, 2021
Modified: April 08, 2023

Bayesian

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The Bayesian approach to statistics is to 'just use probability theory'. You write down a joint probability distribution over observed and unobserved quantities (a probability model), and then calculate conditional distributions over the unobserved quantities given the observed quantities.

Once we agree on a probability model, the Bayesian approach is uncontroversial: it's just math. The philosophical objection is that all models are wrong. A joint probability distribution includes both a prior and a likelihood; both of these are subjective choices which are, in the purest sense, not informed by the data (one can do data-driven model selection, but one must still choose a prior on models).

We might quibble with privileging probability theory as the modeling framework. Some arguments in favor, from What Keeps a Bayesian Awake At Night? Part 1: Day Time · Cambridge MLG Blog (mlg-blog.com):

  • de Finetti's theorem: if you believe that the order of your data are unimportant (ie that they are exchangeable), then there exists a latent parameter θ\theta with some prior p(θ)p(\theta) such that your data are iid conditioned on the parameter. If you're okay assuming a distribution over the data, then this justifies the use of prior distributions on parameters.
  • Cox's theorem: under certain axioms of coherence, probability theory is the unique extension of propositional logic to allow for varying degrees of plausibility. For example, we should get to the same result by updating on one datapoint at a time that we would be updating on the entire dataset at once.
  • The Dutch book argument: if you are willing to take a bet on any proposition at some odds, then an adversary can guarantee to pump away all your money if your odds do not reflect a coherent joint probability distribution on the propositions.
  • Savage's consistency theorem: something something decision theory
  • Doob's consistency theorem: Bayesian inference is consistent. If the data really were sampled from p(DX)p(D | X) for some true value of XX, then the posterior p(XD)p(X | D) will concentrate on that value with probability 1, assuming that:
    • The model is identifiable (i.e., p(X1)p(X2)X1,X2)\left(\text{i.e., }p(\cdot | X_1) \ne p(\cdot | X_2) \forall X_1, X_2 \right).
    • The true XX was sampled from the prior p(X)p(X). (i.e., there can be a set of measure zero under the prior where consistency does not hold --- this is apparently a problem in nonparametric Bayesian inference where people have to prove stronger results)
    • The data D={D1,,Dn}D = \{D_1, \ldots, D_n\} are iid.
    • The proof (linked above) is actually quite technical and involves martingales.
  • Optimality of predictions: the KL divergence between the true density p(Xθ)p(X | \theta) and any data-dependent estimate p(XD)p(X | D) is minimized by the Bayesian predictive distribution p(XD)=p(Xθ^)p(θ^D)dθ^p(X | D) = \int p(X | \hat\theta)p(\hat\theta | D) d\hat\theta.

There is a long-standing divide between 'Bayesian' and 'frequentist' statisticians. But there is no conflict:

  • frequentist statistics is an approach to analyzing estimators.
  • Bayesian statistics is an approach to designing estimators.

One can ask about the frequentist properties of Bayesian estimators: are they consistent? Are they calibrated? A frequentist analysis answers the question: is this a method that will work well across a range of datasets? A Bayesian inference answers the question: assuming a particular model, what should I believe given this data?

These imply different 'philosophies' of probability: a frequentist treats probabilities as a representation of long-run frequencies, while a Bayesian treats probabilities as a representation of subjective degrees of belief. But probability theory is just math; it's just a model. It is interesting and useful that the same math can be used to describe long-run frequencies and to describe subjective degrees of belief, but perhaps not surprising: when the long-run frequency is known (as in a fair coin), it's pretty obvious that this should determine your degree of belief. The Bayesian innovation is to notice that the arguments above justify extending the use of probability theory to subjective situations where long-run frequencies are not well defined.

Should science use Bayesian techniques?

  • The question, "what process can we use to reliably identify valid scientific conclusions?" is a frequentist question (process is frequentist). We'd like to show that our process will yield correct results with high probability.
  • The question, "what should we believe about this hypothesis, given the data?" is a Bayesian question. It would be nice to have methods that provide a route to answering this question.
  • From a statistical point of a view, a natural compromise is to report the likelihood of the data under various hypotheses. This does not assume a prior, but it allows a meta-analyzer to aggregate evidence from multiple papers with their own prior to derive beliefs.
  • The standard p-value approach is to report the likelihood ratio between the studied effect and the null hypothesis. This is problematic in part because the null hypothesis is always wrong, and partly because it answers the uninteresting question of 'is it likely that there is any effect at all?' rather than the more relevant question of 'is this effect worth caring about?'.

Beyond the philosophy, there are practical questions about our ability to 'be Bayesian'. In complex models we will need approximate Bayesian inference techniques, but past a certain level of complexity these become suboptimal for decision-making because computation is important. The more general flaw with Bayesian (and many frequentist) methods is that all models are wrong. A nice argument from nosalgebraist:

I think “ideal Bayes always works” is only a useful statement when you’re talking about really ideal Bayes, where your sample space includes every computable hypothesis.  Outside of that context, which never shows up IRL, we can’t even say that Bayes would be best even if we had perfectly formulated all the information we have into a prior.  If it’s not a prior over all computable hypotheses, it’s still just a truncation, and it might be a badly-behaved truncation.

In practice we can't do inference with a Solomonoff prior, but all other Bayesian methods are subject to misspecification and will inevitably arrive at various pathologies.