STS structure learning: Nonlinear Function
Created: March 06, 2020
Modified: March 07, 2020

STS structure learning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • Very simple brainstorming

  • Say we want to search over STS architectures. An STS architecture here is really just a list of components, with their params. I could think about them as having coefficients for how they're added together, but this is probably just equivalent to setting all the coefs to 1 but changing their scale params. (though I should think about the affect on the prior).
  • There should be a Bayesian Occam's Razor effect. All else equal, we prefer a model with fewer components to one with more components. If we had a birth/death scheme, this would be encoded in the marginal likelihood: as we add more components (of coef 1), they give probability mass to an increasingly large set of signals, thus less mass to any given signal. Note this is true even for coefs of fixed hparams.
  • Then of course there's an explicit complexity penalty. If our model is that whenever we add a component, we roll a die to decide which, then each component comes with a -log N complexity penalty. (though then we have to correct for symmetries I guess, so the overall penalty for K components is only log (N choose K) ). We'd also pay the density of choosing each model's new coefs from their priors. (another way to think about this is that 'type' is just another component hparam that can take on N values).
  • In practice if we had hparams for each process that we imperfectly marginalized out, we'd also be paying for the ELBO gap in each component that we used. Of course this isn't true Bayesian model selection---it's more of an 'implicit regularization' effect we'd expect to be at play in the VI case, which might be good or bad.
  • Suppose we want to approximate this scheme by just fixing a set of components and optimizing over gating coefficients. We'd have:
    • We'd get the same Bayesian Occam's Razor from just optimizing over scale coefs on each component: increasing the scale increases the set of signals it can represent.
    • We could get an explicit complexity penalty by somehow working out a continuous regularization term based on the scales. It would need to specialize to (N choose K) if K of the N scales are 1 and the rest are zero. I might have even already done this in the symmetries work somewhere. Then for model parameters we just have their densities, so that's easy.
      • Ignoring the symmetries for now: why not just weight the prior log-densities of params for each component by its scale contribution? This does the right thing for the discrete 'type' param and also for the continuous prior densities.
    • (Maybe somehow we arrange to weight the KL terms in the ELBO by something like the coefficient of the corresponding component. Again, this would imitate variational implicit regularization, not true Bayesian model selection.)
    • Now for any set of hparams and scale coefs (which might end up being redundant w/ the scale hparams), we have a model that could have been generated by our generative process. Our score for it is just the prior density of all the hparams, weighted by the relevant coef, and the priors on the coefs themselves.
      • The priors on the coefs maybe should be horseshoe-like to accomodate sparsity?
    • Including the weighting, this isn't obviously a direct density of any particular Bayesian meta model. (but there must be some interpetation in which we marginalize out the latents entirely). So what does it mean to do, say, VI in this model? What are reasonable variational posteriors?
    • We'll expect a posterior dependence between all scale params and the observation noise param (which is really a scale param of a noise process). So by default this is in 'non-centered' form. We might prefer to write a 'centered' form where we actually separately parameterize the total scale and the portion of that scale made up by each component.