continuous structure learning: Nonlinear Function
Created: March 06, 2020
Modified: March 07, 2020

continuous structure learning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • Relevant papers:
  • A basic case: STS structure learning(automated statistician style).
  • Thinking about big stuff: really all of differentiable programing is in this category. Once we have a differentiable representation of a class of programs, we can optimize over that class, but we can also do HMC or VI over that class.
  • A very different toy case might be the geometric distribution. We have the probabilistic program: def geometric(p): if sample(Bernoulli(p)): return 1 + geometric(p) else: return 0 whose execution samples from a geometric RV, but along the way creates (and sums) a bunch of discrete Bernoullis.
    • This is actually not a very 'probabilistic' object since conditioned on the output we know exactly that all of the latents were 1 except for the last, which was 0. But's say we're also uncertain over p.
    • Okay well again in this circumstance there's no structural uncertainty because we know what all of the choice points were, so it just reduces to standard beta<->bernoulli conjugacy. But let's humor me for a minute and try to think about a continuous representation that could recover these choice points.
      • so for each choice point i, let's imagine we have a posterior variational bernoulli with some probability q_i.
        • of course there are really an infinity of potential choice points in principle. Let's upper bound it by some stack limit, say 128. (if you needed to go more levels deep than that you shouldn't have written an explicit recursion lol).
        • So one could imagine we 'preflipped' 128 coins to determine the structure of the program. So there are 2**128 possible structures (though really in practice there are only 128, so this is wildly inefficient. even if we searched the entire computation tree we would only discover those 128).
        • One could imagine a mixture posterior over all of the 128 structures. We'd have a variational weight on each structure. Then once we've chosen the structure I guess in principle we could have variational weights on other stuff, but in this case there is no other stuff.
          • How to get a variational weight on each structure? Again we're back at the notion of a guide program. As we run the model we have a variational bernoulli on each choice point. The tricky thing here might be getting marginal weights: if there are different choices that lead to (roughly) the same structure?
          • Okay, so imagine we can 'build the graph' for all program structures up to some limit. Now the variable we're observing (the 'return value') exists in all structures, but in this case it has a different value in each structure, so it uniquely picks out a structure.
    • Ugh okay thinking about this is not nothing but it's a little too weird to be useful. Is there a nicer example?