Stochastic Conditioning: Nonlinear Function
Created: May 09, 2021
Modified: May 09, 2021

Stochastic Conditioning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • I just saw a paper by Tom rainforth L on steca stick conditioning and I wanna try to articulate what you could mean by that and what I think they do mean by that so in standard probabilistic modeling or puts a probabilistic programming um re describer program at genitiv process um that induces a joint distribution over a set of random variable and then conditioned on the values of some of those random variables we can infer a posterior distribution over he can observe variables um but what if we don't actually observe the value so maybe there's some very complicated process that involves um let's say a football game and there some complicated generative process by which a coin is selected and then it's used to determine who kicks off and then various downstream events depend on that and the ordinarily you might observe the the the actual value of the coin that flipped um or might be an observed it might be a latent variable but there imagining a setting where maybe somebody tells us that the coin was flips that the coin that was flipped was a I'm biased going with probability of heads .9 and of course by his friends don't exist but let's pretend so the in that setting um the conditional distribution on the remaining random variables is I guess alright we claim well defined and how how do we define it what does it mean um so I guess in my setup I've been thinking II sort of implied that the it was a process by which first we chose a distribution for the coin and then the actual value of the coin is sampled so this case maybe this is just equivalent to observing be like one slightly higher level up perimeter which is like the dapifer the coin and yeah IK so if that's if that's already an explicit quantity in your model then maybe we don't need special machinery um what if we actually observed that the marginal distribution of the coin flip under the entire probabilistic program is some distribution um great um ultimately we observe an event we observe a subset we know that we're in a subset of possible worlds it's not obvious what it means to be in us subset of possible worlds where a variable follow some probability distribution because in every possible world the variable has an actual value um but I suppose once we identify a subset we can then I I guess we can say that in um well we could identify a subset in which some fraction of the possible worlds variable hesmondhalgh you and in some fraction of the possible worlds it has another value and we don't tell you exactly which possible World War in we just tell you that I you know where in one of these worlds um great um OK but that can't be it because I this doesn't actually pick out a unique conditioning event um right there maybe you know if I wanna condition on the probability that I you know I would if I went to dishon on the fact that that a coin has 50% probability of lending heads up then how do II there many subsets of possible worlds I could tech in which alright half the members of the subset lanceheads in the other hafling tales um and I'm not even sure it's well defined that there is a like most general such subset so I don't think that's the conditioning with their proposing
  • No the first example they give is of a beta bruley distribution I where instead of actually observing the at the coin flip we you know instead of observing a sequence of coin flips we just observe that out of a long sequence of coin flips the average you have heard rate of of queens coming up um it doesn't make sense for a better distribution or doesn't if we actually observed the distribution of heads then we would just know the value of the better and variable and there would be no uncertainty but there case is that if you observe something distributed according to pee and times that should be the same as observing in actual flips that average out to pee so it's it's weaker than observing a single coin flip it's not observing the actual distribution of the coin because that would that would destroy all uncertainty by its up serving it's observing a um it seems like maybe in general it's as if you observed many IID or you you drew many IID realizations of a random variable you observe the distribution of those realizations or some summary statistic but then you only update I as if there were one such random variable so you don't do the full update from all of the random variables um so it's it's weaker write this to castet conditioning is a statement about your evidence not a statement about the generative model um kind of like saying we didn't observe this evidence um but we um you don't be observed maybe some downstream evidence that gives us some some belief you know we observe some noisy measurement of this evidence so instead of actually observing the coin I you know somebody looked at the coin from far away with a blurry telescope and they are based on their observations they believe that there's a 75% chance that the coin is heads and that has nothing to do with whatever generative process put the coin um it's just aspect of the measurement process um but in this mistake you just have a likelihood are so then I don't understand what their formal contribution is
  • So looking at their definition um the I the likelihood of a distribution is obtained by averaging the log likelihoods of at individual outcomes over that distribution and this of course if your distribution is a Delta function alright it just gives you back the original likelihood um if your um I guess for a bernoulli probability retsil averaging log likelihoods is like a geometric average of the actual likelihoods and that's kind of what I prolly likelihood yes Hey it's um you know the likelihood of um one outcome well we we we write it as like P 3 X times one minus P 2
  • I always thought of that as a mathematical trick but there kind of generalizing that no you could ask why don't they just to let saying arithmetick average of the likelihoods I. Apparently that's been done before and they compare the interpretations um I don't totally understand I what they mean but apparently their version has a nice connection to Cal divergents um servid the most likely latent variable I given an observation uh the distribution verwey is equal to the latent variable such that be conditional distribution why give an ex has the closest khaledi vergence 2 the observed distribution now I wonder if they if this actually extends to proportionality um I like there not characterising a posterior hi there just characterizing a most likely point but I if it were true they would probably have sent so
  • So as I understand it there not now explicitly defining a true update operate like a well defined operationen probability theory um in sort of base terms of Sigma algebra is impossible worlds um their defining an update operation um that doesn't itself come directly from core probability theory and there analyzing its properties
  • OK so put differently they um is there saying the likelihood or they the log likelihood um right there instead of using a given log likelihood with a particular observation of why they just average the log likelihoods over multiple observations of why get so that's not it's not a crazy thing to do um and it has this nice connection to cal divergence um and ….bove but there proof shows that it is
  • OK the task that makes the most sense to me so far is hi there are 804 municipalities in New York and we want to estimate the total population and what we know is that someone sampled 100 of those municipalities at random I guess without replacement probably and gave us the many statistics of that distribution on 100 population so they give us the mean the standard deviation and the median the 75% to 25% the minimum the maximum few others right and you know so fine um now somebody else did the same thing or maybe the same person did the process again twice and they give us another set of statistics and this is a potentially realistic rate like the original analysis required access to this whole datasette um but then the paper only reports the statistics and so maybe we want to try to reconstruct the underlying data hi it's a little weird because this is a bit computer twice um you know if it were just once than yo they would I the distribution that you get would be your best estimate off the actual distribution um but um yeah will take it so I guess in this case we have a is there must be 8 prior distribution over the populations of each of these municipalities and then in the um I guess it in their world you would have a instead of just observing a sample from that distribution a single municipality I we could observe um we would have serve these statistics no there is an actual sampling process we could write as a problem list IC program um right so the the process is um so first one draws 804 populations there is a there is a real data generating process
  • Their model is: unknown parameters are the mean and variance of the log-populations of the cities. They put a prior on those. Now let's imagine two settings:
    • We treat the actual populations of 100 cities as latent variables, but we observe summary statistics of that distribution. (well-defined standard Bayesian update).
    • We define the log-likelihood of a (mean, variance) pair as the average log-likelihood of any given municipality size under that pair, averaged over the piecewise distribution from the given quantiles.
      • RED FLAG: why is piecewise interpolation correct? It seems like there's a degree of freedom here not present in the well-defined model.
  • So what if we do this with three/two municipalities instead of 804. Our model is: there are three munipalities with populations prior sampled from