Created: January 27, 2022
Modified: February 15, 2022
Modified: February 15, 2022
flexible model family
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.- As AGW points out here, it is statistically better to fit a flexible model family, with an inductive bias, than a constrained model family. The latter is just a degenerate case of the former. Bayesian inference works, eventually, if the true model is within the hypothesis class; otherwise with infinite data the posterior concentrates on an incorrect model. So, it is good if our class is flexible enough to contain the true model.
- Of course all models are wrong, but really all models are incomplete: there can still be a 'true' stochastic model at the available level of abstraction. If we model your survival from cancer as a coin-flip, then of course this is not the real process, but if the only data you have available are the binary facts of who was and wasn't diagnosed, then there is some abstraction of the true model into a single coin-flip, and our hypothesis class will contain that abstraction (along with many other hypotheses that are not abstractions of the true model).
- So: we want our model families to be flexible enough to represent the true model, abstracted to the variables that we've observed. But to get generalization, we still need to give them an inductive bias.
- Deep networks are flexible classes of functions. Though it is not always obvious what inductive bias a particular weight prior encodes.
- The most flexible hypothesis class is all computer programs that halt. Since that's uncomputable, we add a runtime constraint: all computer programs that halt before taking N steps. Assuming the Church-Turing thesis, and that the process we're modeling is computed by the universe in real time, this implies that the (abstracted) true model is in our hypothesis class.
- (technically, we would need to allow quantum programs; for the time being we focus on problems where quantum effects are not relevant).
- Note that this abstracted true model will be a probabilistic program, since abstraction introduces ambiguity.
- David Duvenaud's example: if you get hit by a bus at age 20, a deterministic model could only explain this by saying it was fated to happen when you were born, and you'd been carrying that fate around as hidden state ever since. That makes no sense. In particular, once we posit hidden state, the obvious thing to do is to marginalize it out using our subjective uncertainty over its value. And now we're back to having a stochastic model.
- So, if we could do inference over general (bounded-time) probabilistic programs, we would solve the fundamental issue of model misspecification! Given enough data, the posterior would eventually converge on the true (abstracted) model.
- I guess this is really part of the promise of nonparametric bayes: a space of probabilistic models that, in principle, includes everything.
- In the sense that deep networks are differentiable circuits, they do give us the ability to do inference over programs. Their architectures may be lacking, but we are rapidly inventing notions of differentiable programming primitives that allow us to build out more interesting classes of programs.
- We could imagine building out the technology to optimize differentiably over the size and architecture of the circuit---and then we can run HMC or fit ensembles or whatever in this continuous space to sample from a 'posterior' over circuits.
- However, the fundamental issue is: as long as these are deterministic programs, they generally will not contain the true abstracted model. At minimum this means we need stochastic computation graphs---circuits that include sampling operations---in our 'universal' model class.
- We define a probabilistic Turing machine or circuit as a deterministic program with access to an infinite tape of pre-flipped bits. By definition, this can compute any computable distribution.
- directions for probabilistic programming: is there a continuous parameterization of a generic SCG that can represent any 'reasonable' stochastic program?
- In some sense this should be something like a hierarchical VAE, with multiple layers of stochastic latents?