Modified: March 16, 2022
mcmc notes
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Note: these are personal notes, taken as I was refreshing myself on this material. They're mostly stream of consciousness and probably not very rigorous or useful to anyone else. The main sources are Wikipedia and Andrieu et al, "An Introduction to MCMC for Machine Learning".
In AI, we often represent knowledge about the world in the form of a graphical model, in which we represent the joint distribution over a set of random variables as a product of simpler, local terms involving only a subset of the variables. In a directed graphical model, these are the conditional distributions of each variable given its parents; in an undirected model they are factors associated with cliques of the graph (or more generally, associated with any subsets of the variables). We are interested in inference, which generally means obtaining where is a subset of variables we have observed as evidence, and is a different subset constituting a query for which we would like the posterior distribution of values given the observed evidence (sometimes we also want the most likely value, but we'll not focus on that right now).
MCMC finds approximate posterior distributions through sampling. We construct a Markov chain whose stationary distribution is the distribution of interest, and then run the chain to produce samples. The samples we get are not independent, since the chain will not have time to mix between successive samples, but given sufficiently many samples we can get a good estimate of the distribution.
Markov Chains
The state space is the set of all possible joint assignments of values to the random variables in the model, each of which we call a state. A Markov chain is defined by function giving the probability of transition from state to state at time (in the special case of a time-homogeneous chain we omit the index ). The function is called a transition operator or transition kernel. If the state space is finite, this transition operator can be expressed as a left stochastic matrix . Remember that a left stochastic matrix is a matrix in which all the entries are nonnegative and the columns sum to 1. Thus if we have a probability distribution over states expressed in a vector , where , then gives the distribution after one step of the Markov chain.
There are some properties which a Markov chain (or its states) can have:
- (ir)reducibility. We say that a state is accessible from if there is some number of steps such that we have nonzero probability of transitioning to state after taking steps from . A chain is irreducible if every state is accessible from every other state. If we drew the state space as a graph, with edges labeled by transition probabilities and omitted for transitions with zero probability, then the graph of an irreducible chain has a single connected component. (note however that there are also graphs with a single connected component corresponding to reducible chains, since accessibility is a directed relationship. We say that two states communicate if they are mutually accessible to each other; then a chain is irreducible iff all states communicate).
- periodicity. A state is aperiodic if there is some such that, for all , there is a path of length having nonzero probability that returns to that same state. By contrast, a periodic state is one which can only be returned to at times which are integer multiples of some period . For example, the states in a bipartite graph are periodic with period 2. A chain is aperiodic iff all of its states are aperiodic. Note that if one state in an irreducible chan is aperiodic, then all of them are, since we can always return to that one state and spend an arbitrary amount of time dicking around.
- Recurrence. A state is transient if, starting in that state, there is nonzero probability that we will never return to that state. A state is recurrent if it is not transient. The mean recurrence time is the expected time to return to the state. Note that a recurrent state does not necessarily have a finite mean recurrence time; for example, if the probability of returning after steps is , then the mean recurrence time is given by , which diverges. A state is positive recurrent if its mean recurrence time is finite. Positive recurrence is important because it implies that the state will be visited infinitely often.
- Ergodicity. A state is ergodic if it is aperiodic and positive recurrent. If all states in an irreducible Markov chain are ergodic, then we say that the chain itself is ergodic.
A distribution over states is called a stationary distribution if it is a fixed point of the transition operator . An irreducible chain has a stationary distribution if and only if all of its states are positive recurrent; in this case the stationary distribution probabilities are just the reciprocals of the mean recurrence times, normalized to sum to 1 (and thus the stationary distribution is unique). Note that even aperiodic chains can have stationary distributions, e.g., an "alternating" chain with two states that each transition to the other with probability 1 has a stationary distribution . However, only an aperiodic chain is guaranteed to * converge} to its stationary distribution, i.e. have .
If the state is not irreducible, it may have several stationary distributions: one for each communicating class (set of states which communicate with each other), and then also all convex combinations of those.
If a chain is not time inhomogeneous, it can still have a stationary distribution. This is the case if we have multiple transition operators that each respect the same shared stationary distribution. Note that given transition operators and each having stationary distribution , we can define the mixture kernel (corresponding to applying at each step with probability and otherwise) and the cycle kernel (corresponding to alternating between the two operators in sequence), and both of these will also have stationary distribution .
A further property is reversibility or detailed balance. This is the condition that there exists some distribution over states such that, for all ,
It is easy to show (by summing both sides over ) that any such distribution must be a stationary distribution. Thus, if we can construct a chain to be in detailed balance with respect to some distribution , then the chain must have as a stationary distribution. This is usually how MCMC constructions work.
Summary. A stationary distribution exists as long as the chain is positive recurrent. If it is irreducible, the stationary distribution is unique. If it is aperiodic, the chain will mix to the stationary distribution. Ergodicity summarizes all of these to give a condition for "nice" Markov chains. We can also show the existence of a stationary distribution through detailed balance; in this case, ergodicity guarantees that the stationary distribution is unique.
Metropolis Hastings
The Metropolis-Hastings (MH) algorithm is the basis of most MCMC algorithms used in practice. Here we introduce a proposal distribution , from which we sample a proposal for a new state given the current state, and our transition kernel chooses to accept or reject the proposed new state according to its relative probability compared to the current state. Specifically, we compute an acceptance probability given by
and then accept the proposed state with probability ; otherwise it is rejected. If the proposal is rejected, we remain in the current state until the next time step. Thus the MH transition kernel is given by
where is an indicator function returning 1 if and 0 otherwise, and , the probability of a rejected proposal, is given by
It is easy to show that MH is in detailed balance with respect to . Note that when we have detailed balance as a tautology, so we need only consider the case where . Here we find
so MH is in detailed balance with respect to , and thus is a stationary distribution of the MH chain. To show that this stationary distribution is unique, we need to establish irreducibility; here it is sufficient that is supported across the entire state space (technically we in fact only need that it is supported wherever is supported, since this established a communicating component that includes the entire stationary distribution). To show that we converge to the stationary distribution, we need the chain to be aperiodic; this is immediately established by the possibility of rejection at each step.
The independent sampler is the special case of MH in which the proposal distribution does not depend on the current state . This is kind of like rejection sampling, but still produces correlated samples since the acceptance probability of the next state is still determined by a comparison to the current state. The Metropolis sampler is the special case in which the proposal distribution is symmetric, i.e., . Here the acceptance probability reduces to . This can be nice since it eliminates the need to compute .
Derivation and Optimality
(from Chib, Greenburg, "Understanding the Metropolis-Hastings Algorithm" (1995))
We can actually construct the MH sampler as the optimal acceptance probability to maintain detailed balance for an arbitrary proposal distribution. Recall that detailed balance requires
and suppose now that we have some proposal distribution . If this proposal satisfies detailed balance, then we are done, but in general there will be some such that
without loss of generality since we can always swap . This inequality means that we move from to "too often" relative to the reverse move. We'd like to rejectThere could be strategies to fix this problem other than rejection, but it is an obvious thing to consider. some of those moves to preserve the balance, so we introduce an acceptance probability , which we require to satisfy detailed balance, i.e.,
Solving this for the 's, we get the condition
This is a condition on the ratio of the 's, meaning that we retain the freedom to scale them both by any constant. The obvious thing to do is choose the largest possible scaling constant, so that we maximize the 's, i.e., maximize the acceptance rate of the chain. Since the 's are probabilities, the largest possible value is , so we take (by our inequality above, we are assuming without loss of generality that this is the larger of the two 's) and therefore
where the ratio on the right is guaranteed to be less than 1, again by our assumption above. In general, this yields the MH rule
in which the operator automatically chooses the appropriate direction in which to restrict the acceptance probability. And as we showed, this is optimal within the space of acceptance rules achieving detailed balance, in the sense that it will accept all moves in the underepresented direction, and then only reject in the other direction with probability required to satisfy detailed balance.
Random Choice of Proposal}
(most of this taken from Matthias Seeger's Notes on Metropolis Hastings)
Suppose we have several proposal distributions indexed by . These may help us explore the space in different ways. Since the stationary distribution of the MH kernel is determined by construction, and doesn't depend on the proposal distribution, we can thus define a MH kernel for each proposal distribution and these kernels will all share the same stationary distribution . Thus we can use them in sequence, or we can randomly sample a different proposal to use at each step, and the corresponding cyclic or mixture kernel will also have stationary distribution . What if not all of the proposals are supported across the entire state space (e.g. if each one only has the potential to modify a subset of the variables)? I believe (without having worked out a formal proof) that everything works out as long as the composition of the proposals (in the cyclic case) or a probabilistic mixture of the proposals (in the mixture case) is supported across the entire space.
The more interesting case comes when we want to choose which proposal to use in a way that depends on the current state . Here we have a distribution , so we could consider the marginal proposal
But this may not be easy to compute, so we'd like to be able to just sample from at each step and then use the resulting proposal distribution . This process gives the same sampling distribution for as the marginal proposal above, but the acceptance ratio is different since we're only evaluating the proposal density for a point estimate of . Unfortunately, this does not necessarily give us detailed balance and so will not necessarily maintain the stationary distribution. In proving detailed balance for standard MH, the relevant part of the transition kernel was and the key observation was that the expression is symmetric with respect to . Now, the relevant part of our transition kernel is , where we are marginalizing the kernel over . Note that this is different from marginalizing the proposal over , since now we are accepting or rejecting based on the probabilities from the specific proposal we sampled rather than using the marginal proposal. Now detailed balance will require
which boils down to showing that the expression (appearing inside the integral)
is symmetric with respect to . The obviously difficulty is the factor; we'd really like this expression to be
since this would in fact be obviously symmetric. Unfortunately, that wasn't how things worked out, but it does mean we can guarantee detailed balance in the special case where these expressions are equal, i.e. when
When is this condition satisfied? Well, trivially it holds if does not in fact depend on , as in the simple case described above. More generally, it holds if we can think of our state as being factored into several variables, where we are proposing to update a particular variable or variables but choosing using only the values of other variables that are left unchanged by the proposal.
What to do if the condition is not satisfied? If we can't compute the marginal proposal distribution, another possibility is to incorporate into the state space directly as an auxiliary variable.
We can also do the RJMCMC thing, which is equivalent to just implicitly including an auxiliary variable (see below). That is, we explicitly symmetrize by including the choice of proposal in the acceptance ratio, by solving for the necessary condition
and optimizing as above to get
That is, we're free to choose kernels intelligently based on the current state, but we have to account for the probability of making the opposite choice. In the RJMCMC case, this means we're free to choose an appropriate birth proposal based on the current state, but in order to accept we'll have to be willing to choose the corresponding death proposal from the new state we end up in.
To see this as an auxiliary variable method, we treat as an auxiliary variable with a uniform distribution (at least if the domain of is finite). Then it will always cancel out of , and we don't actually need to keep track of it between steps. Now the proposal is really over the joint state, but we ignore the "previous value" and just propose from by the chain rule: , which is just the factor under different notation, thus recovering the RJMCMC acceptance ratio.
WAIT WHAT? In the auxiliary variable construction we'd naively get in the denominator, but in the numerator. Which is not the RJMCMC rule, since we can't identify the two s. We don't track , but we can always materialize it from the target uniform distribution. But if we did this we would not in general get high acceptance probabilities. So I still believe the RJMCMC argument is valid, but I don't think the auxiliary model argument justifies it.
Simulated Annealing
If we only want the highest-probability state, we can run MH with a non-homogeneous kernel, constructed so that the invariant distribution at time is proportional to , where is a cooling schedule such that as . Then in the limit, we have a stationary distribution in which all the mass is at the global maxima. This is called simulated annealing. It works well if the proposal distribution is good, so the chain mixes well, and if the cooling schedule is appropriate to the mixing rate of the chain (e.g., we don't want to cool all the way to zero before the chain has even had time to mix).
Gibbs Sampling
A special case of MH is when we have separate proposals for each variable , and we choose the proposal distribution to be the conditional distribution of that variable given all the other variables, i.e.
where we use to denote all but the th component of . In a graphical model, this reduces to the conditional distribution of the variable given its Markov blanket: in a directed model, its parents, children, and mates (other parents of its children); in an undirected model, just its neighbors.
Given this proposal distribution, we have acceptance probability
We can expand out the stationary probabilities using the chain rule to get
Now the key is to realize that ; i.e. the current and next state agree on everything except . If we make this substitution, writing
then we see that every term cancels, so our acceptance probability is always 1! Thus a Gibbs move is always accepted.
Since we have lots of individual proposals, one for each variable, we can either randomly sample a different variable to update at each step, creating a mixture kernel, or more typically just choose an ordering in which to update the variables, creating a cyclic kernel. Either one is fine, as discussed above. Note that we don't have to update the variables with equal frequency. We do, however, have to be careful about selecting which variable to sample in a way that depends on the current values; in general this would not preserve the stationary distribution of the chain.
Multiple Try Metropolis / Hit-And-Run
Multiple Try
(ref: Jun S Liu; Faming Liang; Wing Hung Wong, "The multiple-try method and local optimization in metropolis sampling", 2000)
Multiple-try allows us to make a bunch of proposals, then choose (approximately) the best one. Let . Now starting in a state , we draw proposals , and compute for (note these are the denominators of the acceptance ratios for the individual proposals). Now sample a from the 's with probability proportional to . Given this , let be samples from , and let . Now accept the move to with probability
It turns out (see the ref cited, the proof is not hard) that this satisfies detailed balance.
Hit-and-Run
Basic idea: select a direction to move in by sampling from the unit ball surrounding the current state. Then
- If possible, sample from the Gibbs distribution along that direction. This is just a Gibbs step in a reparameterized space and, therefore, automatically accepted.
- If it's not obvious how to sample directly from the Gibbs distribution, can use multiple-try metropolis: sample a bunch of candidate distances from a uniform, Gaussian whatever proposal distribution, and then follow the multiple-try procedure to choose one of them and accept/reject the final proposal (I think this is sometimes called "random ray" Monte Carlo). \end{itemize}
Generalized Hit-and-Run
Delayed Rejection
(ref: Antonietta Mira, "On Metropolis-Hastings algorithms with delayed rejection")
Suppose we have two proposals, , and . We perform a standard MH move with proposal , yielding acceptance ratio
If this proposal is rejected, we can delay rejection and make a second proposal from , where, notably, our second proposal can depend on the value we just rejected. To figure out how to do this, consider the probability of moving from to ,
We want to find such that detailed balance holds, i.e., . We know by construction of that the first term satisfies detailed balance:
so it remains to impose detailed balance on the second term:
It is sufficient (though perhaps not necessary) to require that the condition hold for * all} values of , i.e.,
which yields the ratio condition
By the same argument as the MH derivation above, the optimal choice is
Using this as the acceptance ratio, we can try a "second proposal" if fails, at some additional computational cost since we have to compute the proposal density . It's possible to do multiple rounds of delayed rejection, but the complexity rises quickly.
Hamiltonian Monte Carlo (HMC)
Suppose that our target distribution is differentiable and we know its (log) gradient. How can we use this in a MCMC algorithm?
Short answer: see Neal (2011), and http://mlg.eng.cam.ac.uk/tutorials/06/im.pdf. The next few sections are just cribbed from the Neal article.
Hamiltonian dynamics
The Hamiltonian is a function of the position and momentum of all particles in the state (in general, if we have particles in a -dimensional space, then and will each be -dimensional vectors). Given this function , the dynamics of the system are defined by differential equations:
In Hamiltonian Monte Carlo, we'll consider Hamiltonian functions of the form
where , a function of the current position, is called the potential energy, and , a function of momentum, is the kinetic energy. Using HMC to sample from an (unnormalized) density , we typically define the potential energy
and kinetic energy
where is a PSD "mass matrix", typically diagonal; this functional form corresponds to the unnormalized log-density of a zero-mean Gaussian with covariance . In this setting, we have
which says, intuitively, that the rate of change of position is given by the (scaled) momentum, and the rate of change of momentum is given by the slope of the log density, i.e., the rate at which we are "falling" downhill. Note that by integrating over these dynamics we obtain a mapping from the state at time to the state at time .
Hamiltonian dynamics have several important properties. Reversibility means that given the state at time , we can apply an inverse operator to obtain the state at time ; this inverse is obtained by negating the time derivatives, or in our particular setting, by negating the momentum, applying , then negating again. Conservation means that the Hamiltonian is invariant over time: . And volume preservation means that if we apply the operator to some region in space, the image of under will have the same volume as : this can be shown by analyzing the determinant of the Jacobian of . Volume preservation is apparently a consequence of a more general property, symplecticness, that I have no intuition for.
Discretization
Unfortunately Hamiltonian dynamics are defined implicitly as a differential equation; in general the forward operator will not have a nice analytic form. We can, however, simulate the dynamics on a computer using a discrete approximation. The obvious thing to try is Euler's method: given the state at time , compute the state at time by taking gradient steps of size .
A modified Euler's method, with better numerical properties, is to take a gradient step on , then take a gradient step on (using the new value of ), instead of updating both variables simultaneously:
This method can be shown to exactly preserve volumeThe update for each variable is a "shear" transformation -- updates to a subset of variables that depend only on a different subset of variables -- which has a Jacobian matrix with determinant one. For example, in the update for , the new is just the old plus some stuff that depends on , and the new is just the old . So the diagonals of the Jacobian are ones, and the fact that doesn't change makes one of the off-diagonals zero, so the determinant is 1., so it avoids divergence that comes from volumes expanding to infinity or contracting to zero (i.e., spiraling into the origin). However, it is not reversible: if we end up at state , and try to take a step with the reverse momentum , the first thing we do is update the momentum based on the gradient at , and now our next move on will follow that new momentum instead of the one that got us from to , so we won't end up back at . You might try to fix this by considering the opposite procedure (the one which updates , then ), but this has the same problem just pushed back one step: starting at we will take the correct reverse move to but now we will update using the gradient at , when the actual update to go from to would have used the gradient at .
To preserve reversibility, we make one final tweak to yield the leapfrog method:
These are all still shear transformations, so the leapfrog preserves volume. Even better, the leapfrog is reversible: if we negate at the end of the trajectory and apply the updates again, we'll end up back where we started (after negating a final time). This follows just from the symmetry of the update rules. Essentially we're updating the momentum based on the halfway points of our jumps, rather than at the "start" or the "end", and flipping the start and end of the jump doesn't change the halfway point.
It can be shown that the leapfrog introduces single-step error of order , vs for the two non-reversible methods.
MCMC with Hamiltonian Dynamics
In general, given an energy function , we can define the canonical distribution over states
for some temperature and normalizing constant . The Hamiltonian is an energy function, so
defines a joint distribution on and . Since the Hamiltonian is invariant under Hamiltonian dynamics, energy will be conserved, i.e. a perfectly simulated Hamiltonian trajectory will move within a surface of constant probability density in the joint space. Note that this joint density will factor into two canonical distributions: the target density and a distribution on the momentum . Using the standard quadratic kinetic energy, the distribution on is a zero-mean Gaussian with covariance , though other choices are possible.
The standard Hamiltonian MCMC algorithm is:
- Resample the momentum from its Gaussian prior.
- Run some number of leapfrog steps with step size to propose a new state .
- Accept or reject this proposal with a standard Metropolis-Hastings update. By the reversibility and volume-preserving properties of leapfrog dynamics, the proposal is symmetric, so the acceptance probability reduces to the Metropolis ratio
The first step, resampling the momentum, is valid because the target density of is just its prior, so resampling from the prior preserves stationarity, and and are independent. But why do we need to do this? First, there is the nice property that we don't actually need to store the momentum variables between steps, but this is incidental. The real reason is that the Hamiltonian, thus the joint density, is (approximately) preserved under (approximate) Hamiltonian dynamics, so we need an additional step to allow us to jump to different levels of the joint density. In particular, if we didn't do the resampling, the potential energy could never exceed the initial total energy , so we would never be able to explore regions of low probability density (high potential energy).
The second step, running the Hamiltonian dynamics using the leapfrog integrator, is just an arbitrary deterministic function of the current state: the fact that it is inspired by physics is irrelevant to the structure of the MCMC algorithm. If we were running true Hamiltonian dynamics, then the Hamiltonian and thus the joint density would be preserved by this step, and the MH acceptance probability would always be 1. However, the leapfrog discretization introduces changes in the Hamiltonian, so we need to include the MH step. Note that if we wanted to use a non-reversible set of discrete dynamics (e.g. Euler's method), we would additionally need to compute the forward and backward proposal densities, which is tricky since it involves integrating over all possible momentums you could have resampled.
Note that the momentum resampling would be valid as its own step, but the second component (Hamiltonian dynamics) would not, because the overall symmetry of the proposal depends on the fact that , i.e. the ability to reverse the trajectory by negating the momentum. The Hamiltonian move by itself is deterministic, and not its own inverse, so the probability of returning to the initial state under is zero unless we include the momentum resampling.
The previous arguments combine to motivate the algorithm and show that it leaves the canonical distribution invariant. Of course, the null algorithm also leaves the canonical distribution invariant. For HMC to be valuable, it needs to actually mix to the canonical distribution, i.e. the chain needs to be ergodic. This is apparently implied in general by the fact that the momentum distribution is supported everywhere, and arbitrary values of momentum can move us to arbitrary points in the space (not obvious to me why this is true?). But there's a problem in practice if our step size is such that the discrete dynamics exhibits periodic behavior; one solution is to sample the step size randomly from some small interval.
Alternate Dynamics
Since the Hamiltonian dynamics are ultimately just an MH proposal, we can introduce lots of approximations, as long as the proposal remains symmetric and volume-preserving. For example, we could run the leapfrog on an approximate Hamiltonian, e.g. a GP approximation to the true density. Or we can impose parameter constraints by including "barriers" that we bounce off of. Several other possibilities are described below.
Windowed HMC
Plotting the value of over the course of a leapfrog trajectory, we might find that it oscillates around the original value, and our acceptance probability might be low or high depending on where exactly this oscillation is when the trajectory ends.
Windowed HMC exploits this by operating on sequences of states instead of just single states. For a given state , we can imagine the infinite sequence of states defined by forward and backward leapfrog steps from . More concretely, imagine the set of finite length- substrings of this sequence that contain . Windowed HMC corresponds to mapping from to a uniformly chosen member of that set, performing leapfrog steps to extend the trajectory forwards to have length (where we'd generally choose ), then proposing the final steps of the resulting trajectory as the "new state" in the higher dimensional space. This proposal is accepted or rejected by MH using the density induced on these length- sequences by the mapping we described. The result will be a length- sequence, chosen from either the beginning or end of the trajectory depending on whether the move was accepted or not. We finally map this down to a single state by sampling from the states in this sequence with probabilities proportional to their densities under the canonical distribution. This overall step can be shown to leave the canonical distribution invariant.
In general, windowed HMC has lower rejection rates because the sequences effectively "smooth" over their components, avoiding the oscillation in values of . Apparently this can help a lot in practice. An additional nice property is that, even when a move is 'rejected' in the high-dimensional space, we end up resampling a new base state from a window surrounding the start state, so the chain still progresses.
As an implementation trick, instead of storing all states from the accept and reject windows, we can sample a single "accept state" and "reject state" in an online fashion as we're computing the relevant portions of the trajectories, by keeping track of a candidate and replacing it with a candidate with probability equal to where is recursively defined as the sum of densities at the points generated thus far.
NUTS
(ref: http://arxiv.org/abs/1111.4246)
Suppose that the goal of an HMC step is to travel as far as possible from the initial point . We can analyze this using the derivative of the squared distance :
which we see is proportional to the dot product of , the vector from our initial position to the current position, with the momentum . If this quantity becomes negative, we are moving back towards the initial position, thus making negative progress. So the naive thing to try would be to run the leapfrog dynamics until this quantity becomes negative. Unfortunately this is not reversible; the NUTS sampler is essentially a trick to construct a reversible move from this naive idea. It seems a bit complicated and involves building trees and such.
There's a separate issue of choosing the optimal step size . One approach is Robbins-Munroe optimization: whenever a move is rejected, decrease the step size, and increase the step size when a move is accepted, using a decreasing learning rate such that and . The problem is that the decreasing learning rate does most of its adaptation early on, when the optimal step size during the burnin phase might actually be quite different from the optimal value at convergence. Instead, NUTS uses a dual averaging update that (apparently) avoids this issue. It's possible to separate this from the no-U-turn approach to setting trajectory lengths, i.e., we can run HMC with fixed while using dual averaging to adapt , or we can run the no-U-turn sampler with a fixed .
Short-cut trajectories
Instead of adapting the step-size parameters, we could just propose them from a very broad distribution (e.g. log-Normal with high variance) and then set up the steps in such a way that the step takes very little time for a bad value of . This way the parameter will automatically "adapt" (in a computational sense) to an appropriate value for different parts of the state space, without our having to do any explicit adaptation.
One way to do this is through trajectory reversal. We can group our leapfrog steps into blocks of size , and after each block, test to see if the standard deviation of values for that block is within some acceptable window (too high means stepsize is too large, too low means the opposite). If the test fails, we reverse the trajectory, moving back towards the initial state (by reversibility, these steps are all known and don't need to be computed!) and then off in the opposite direction. If another block of steps fails the test, we reverse again, and at this point all steps in the trajectory have been computed (we will just bounce back and forth between the reversal points until we've taken steps)! So we've escaped from the bad stepsize at greatly reduced computational cost. By contrast, if the stepsize is good, we will likely never reverse and thus will end up in the same place as always.
For this approach to be valid, the test for a block has to give the same result when applied to the reversed block. If this condition holds, we can see that whatever state we end in, if we undertake the reverse trajectory, then any block that caused us to reverse in the forwards trajectory will again cause us to reverse, so again the dynamics are fully reversed and we'll end up in the start state. Since this trick preserves reversibility, it is a valid symmetric Metropolis proposal.
Riemannian Manifold HMCMC
automatically adapt to correlation structure in high dimensions?
http://www.dcs.gla.ac.uk/inference/rmhmc/
http://arxiv.org/abs/1304.1920
Stochastic Gradient HMCMC
http://arxiv.org/pdf/1402.4102.pdf
Affine Invariant MCMC
(ref: Goodman, Weare, "Ensemble Samplers with Affine Invariance", 2010)
Many MCMC algorithms are very sensitive to affine transformations, i.e., they have a much harder time sampling from an anisotropic Gaussian than from an isotropic Gaussian, since the maximum step size is essentially limited by the standard deviation along the most constrained direction. For example, random-walk MH, Gibbs, and HMC all have this drawback (MH and HMC are rotation invariant, so they are not fazed by high correlations between variables, but they are still sensitive to variables of wildly varying scales). By contrast, an affine-invariant sampler is one in which the samples generated are unchanged if an affine transformation is applied to the space. That is, if we define a Markov chain by , where is the target density, is a string of random bits, and is a deterministic function defining the dynamics of the chain, then an affine-invariant chain has the property that
for any affine transformation . Goodman and Weare propose a method to achieve this property using an ensemble sampler, i.e. chain that tracks multiple "walkers" in parallel, so that the stationary distribution of each walker is the target distribution (though the walkers do not have to move independently, allowing the trajectories of individual walkers to be non-Markov since historical information regarding the trajectory of walker can be encoded in the current positions of the other walkers). The approach is based on the Nelder-Mead simplex method for deterministic optimization, which is also affine invariant.
The essential component is the "stretch move", which updates one ensemble component conditioned on some other component , by the move
where is sampled from some density satisfying the condition . One possible candidate is , and otherwise, where is a tuning parameter. Note that this move is affine invariant since it is just an affine transformation of and . The acceptance ratio is
which I don't understand how to derive :-(. To to run the general algorithm, cycle through the walkers, and for each walker , choose a random to apply the stretch move. The updates can also be done in parallel: split the walkers into two partitions and , then alternate updating in parallel given , then the reverse. The easy parallelization of ensemble samplers can be very valuable when drawing lots of samples from an expensive-to-evaluate likelihood, e.g., in large scientific applications.
Another affine-invariant move is the "walk move", in which we choose a small set of walkers (say, of size three) independently of , then propose from a Gaussian centered at with covariance given by the empirical covariance of . Empirically, this seems not to work as well; one possible explanation is that a good proposal requires randomly selecting * three} walkers close to , while a good proposal in the stretch move only requires selecting one such walker.
Parallel Tempering
Rao-Blackwellization
Reversible Jump MCMC
(most of this taken from David Hastie and Peter Green, "Model Choice using Reversible Jump Markov Chain Monte Carlo")
Reversible jump MCMC deals with the problem of sampling from a space with varying number of dimensions. This includes model choice problems, e.g. selecting between polynomial models of different degrees, as well as situations in which we have a single model with a variable-dimension parameter, e.g. number of cluster components.
General view of MH
Suppose we represent our state as a vector . A proposed next state can be given by a deterministic proposal function , where is a vector of random numbers sampled from some known density . Here is a new state and represents any other information we need to keep around in order for to be an invertible function. Let be the inverse of , and note that a reverse move will have to sample from some distribution (this * could} be the distribution on obtained by sampling from , then passing through for some particular value of , but in general it doesn't have to be). Let be the probability that the move from to is accepted. Then we can write the detailed balance requirement as
where and are any two arbitrary regions of the state space. This is the continuous generalization of the detailed balance condition given earlier, where we now require that probabilities of transition between any two regions of the state space must be reversible. We can substitute variables on the right side to arrive at
where we now take , and the term is the Jacobian determinant introduced by the change of variables.Intuitively, you can think of the Jacobian as accounting for things like changes of units: a uniform distribution on distances between 0 and 1 kilometer has constant density 1.0 on that range, while a uniform distribution on 0 to 1000 meters has density 0.001 even though it's exactly the same distribution. So if we're switching from a state representation measured in kilometers to a representation measured in meters (or a higher-dimensional representation denominated in, say, km), then we have to adjust the densities accordingly. Note that a sufficient condition for this equality to hold is if we have
for all and . Now if we choose an acceptance probability
then, substituting this back in we have
where the third line uses the property that the Jacobian matrix of , the inverse transformation, is the inverse of the Jacobian of . This proves that the choice of given above yields detailed balance.
To connect this approach to traditional MH notation, which uses a proposal density , we note that we can construct such a density just by integrating over :
Multiple move types
If we have multiple types of moves, we can allow the move taken to depend on the current state . Formally, let be a countable set of move types indexed by , where we let indicate the probability of applying move type in state , and each move is formally defined by some distribution and proposal function . Now the probability of moving from region to region of the state space is given by
and thus the detailed balance requirement is
It's clearly sufficient to show that detailed balance holds for each individual move type, which we can do by choosing the appropriate acceptance probability, similarly to above but now incorporating the move type probability:
Note that here the Jacobian term is with regard to and thus is implicitly specific to the move type in question. Also note that the move probabilities do not necessarily have to sum to 1. If they don't, then the remaining probability mass corresponds to no move being attempted.
Note that this approach is equivalent to taking the move type as an auxiliary variable, as discussed above in the "Random Choice of Proposal" section (with the caveat that in the current exposition we're assuming that the state consists only of continuous variables). To see this, note that is defined to mean , so we have
and
so both our proposals and acceptance probabilities can be written in terms of the joint state . Alternately, we can think of this as an extension of MH: just as MH extends the Metropolis algorithm, allowing asymmetric proposals by explicitly incorporating the proposal probabilities into the acceptance rate, we can extend MH to allow asymmetric move probabilities, of the form , using the same trick of just incorporating the probabilities into the acceptance rate in order to cancel out the asymmetry.
Discrete variables
If our state includes discrete variables, we can represent the transitions between these values as different moves. For a general discrete state space, we let denote the move which sets value , then we can represent the proposal distribution by . If our space also has a continuous component, then we treat that component independently of the discrete component by taking the Cartesian product of the continuous moves we want with the discrete moves just defined. But this isn't necessary in general; we can define a joint proposal on the continuous and discrete components by defining a different set of continuous moves for each discrete move.
Changing dimension
add stuff about n+r = n'+r' requirement in order to be a diffeomorphism.
Open-world Inference
A special case of interest is that of a world containing an unknown number of objects, each described by some -dimensional parameter vector, so that a world with objects is described by parameters. Here our state is given by a vector describing a world of objects, and we want to construct a Markov chain that samples from a distribution .
A simple example of this type is a Gaussian mixture model in which we have an unknown number of mixture components, each parameterized by its mean location (we assume for simplicity that all components have the same weight and the same covariance matrix).
In an open-world scenario, we can imagine a few specific types of moves:
- Birth/Death. A birth move creates a new object and initializes its parameters according to some random proposal. In the simplest form, we might think of itself as being a proposal distribution for a new object, in which case the birth move looks likeHowever, in general we want our new object proposals to be allowed to depend on the current objects. To do this, we go back to thinking of just as a source of random bits (or more realistically, samples from well-known distributions like a standard uniform, standard Gaussian, etc), and move the hard work into the proposal function .
The inverse of a birth move is a death move, which deletes an object. Here we encounter a subtlety: the vector representation used above confers a notion of ordering on the objects in the world, i.e. it is formally different to write than to write . The simple birth move above always adds the birthed object as the "rightmost" object, so its inverse would always kill the rightmost object. Although this is formally sufficient to maintain irreducibility -- we can always get from one state to another state by killing all the objects of the first state, then birthing all the objects of the other state -- we probably actually want to be able to kill arbitrary objects, not just the most recently-created one. Thus our death move requires a choice of which object to kill.
warning: below this point is a bunch of rambling, which does resolve the issue in question, but needs a lot of cleanup to be presentable to anyone.
We can implement this formally using the notion of discrete states discussed above: for each integer , we define a birth/death move pair, in which the birth move inserts the new object at position , and the death move kills the object at position . In a state containing objects, we define such that all birth moves up to inclusive have equal probability
okay, the confusing thing about the paper is that it doesn't say how to choose between applying forward and backward moves. it says a move takes us from (k, x) to (k', x'), and the backwards move does the opposite. one approach is to have formally different birth moves for going from 1 to 2 objects, 2 to 3 objects, etc. then part of our "k" is the number of objects currently in the world. so at any given point, most of our moves aren't applicable: only the ones which result in objects can be inverted, and only the ones which start with k objects can be pushed forward.
this is essentially using the discrete side to to the bookkeeping. what about moves that don't change k? in a fixed-dimensional space…
sometimes we have a proposal whose inverse is the same as itself (e.g. a symmetric Gaussian move), so it doesn't matter how we choose. but now suppose we have a "move-right" proposal which samples from a positive Gaussian, and its "move-left" inverse. if we sample this move type, what happens? do we move left or right? one answer which I think is formally consistent, is to define two versions of the move, one for the move-right (whose inverse is move left) and the other for the move-left (whose inverse is the move-right). Then the convention can be that if both the forward and inverse moves are possible, we just always use the forward move. This lets us set whatever probabilities we want inside of . This is kind of an ugly hack, but I don't see another good way to do it… Actually, one easy fix is just to let range over move directions as well as move types. This is equivalent to just adding another term that assigns probabilities to each move direction. In most cases only one move direction will have any probability, so this term essentially just disappears.
This really is intuitively weird though. The formalism explicitly includes the reverse move, so it doesn't make sense to just ignore it and say we always do forward moves (but there just happens to be another forward move identical to the current reverse move).
Of course, we can always just use standard MH moves when the discrete state is not changing. The whole reason to separate the forward and reverse moves is that they act on different domains. If they have the same domain, you can wrap them both up in a single proposal function since this can just as easily be flipped to interrogate .
One thing to keep in mind: at the bottom of all this is the construction of a Markov chain. So if I ever get really confused, just think of the chain we're constructing.
end rambling part 1
It's important to think of the interaction between the birth/death moves and other within-dimension moves. For example, if all birth move set a particular parameter to exactly zero, then it will be impossible to kill any object for which that parameter is nonzero. A general principle is that the objects resulting from a birth move should be supported on the entire space of objects that might conceivably be arrived at through within-dimension moves, since otherwise it is possible to arrive at unkillable objects.
Another (hopefully obvious) point in constructing valid birth moves is that our proposals have to be probabilistic: we can't just propose something arbitrarily, but for whatever we propose we must be able to calculate the probability of having proposed it.
- **Split/Merge.
Auxiliary Variable methods
Note: this is following Storvik's "On the Flexibility of Metropolis–Hastings Acceptance Probabilities in Auxiliary Variable Proposal Generation".
We want to make an initial proposal , then run a few MH steps to fine-tune that proposal into our final proposal . Unfortunately it's not obvious how to directly compute the density . As an alternative, we can think of the intermediate states achieved by the MH steps as auxiliary variables , so our proposal is really where where denotes an MH step.
Since we're include the as auxiliary variables, we need a joint target distribution . We define this as where is some distribution we get to choose in order to maximize the acceptance rate. Note that this has marginal distribution , so the 's we sample will in fact be correctly distributed. For the reverse move, we must use the same target distribution , though we're free to choose a different proposal distribution if we like. This gives acceptance ratio
Note that, in this approach, the old values of the auxiliary variables also appear in the acceptance ratio, so in principle we should store them. In practice, if it is possible to sample from , then we can lazily generate on demand, since the move that replaces a stored with a new is a Gibbs move that always accepts. So a move in this framework -- the "proposition 3 framework" -- looks like: first propose an initial , then run MH for a while to generate , then propose , then sample , and use these sampled values in the acceptance ratio above.
There is a slightly different approach, the "proposition 4 framework", which is a bit more flexible. Here we think of the joint state as , i.e., we pretend to have already generated the 's we'll use for our next proposal, so the target density is where includes MH steps as defined above. Of course we do not need to explicitly store since we can Gibbs-generate it on demand by sampling from .
In this setup, our joint proposal is , i.e., we sample given our (presampled) , and then sample new from some distribution . Note the difference in role of : here it is (formally speaking) part of the proposal, rather than the stationary distribution as above. The main practical difference is that now can depend, not just on , but also and . For example, we could force by taking .
Choices of
Storvik considers functions of the form
in which we assume that are MH moves for , though the moves with may be arbitrary. This is equivalent to a hypothesis that the first steps of are generated by our forward proposal distribution,There is some subtlety here, which is that our first value is usually dependent on the previous state , . In the prop 3 setup, we are allowed to let depend on , so we are okay if , that is, if the relevant information from is still preserved in , but otherwise we may need to choose some other which will then not cancel out of the acceptance ratio. In the prop 4 setup, it seems like this shouldn't be an issue, but I am therefore a little confused about why Storvik bothers mentioning the issue with regard to prop 3. while the final steps are generated by running from the destination . The special case corresponds to assuming the entire sequence is generated by MH from our destination, meaning that we can ignore the destination and just compute the weight function
where the "weight function" is simply the component of the acceptance ratio coming from the forward move, with an analogous function for the backwards move; see the paper for more. This case says that we can essentially ignore the fact that we've run additional MH samples: we accept the final state of the MH sequence if and only if we would have accepted its start state .
Another special case is , which gives weight
Note that this requires us to be able to compute the density , but does not depend at all on the intermediate states ; this is because for those variables we have defined so (under either the prop 3 or prop 4 setups) they are being sampled from their stationary distribution.
More generally, we can use linear combinations of weight functions. TODO: I don't really understand why and the paper doesn't spell it out (except in the importance sampling section). This lets us use information from all of the MH samples, though the results in the paper don't seem to get much of a benefit from this. one answer: we get to choose , so we can define it as a mixture distribution, which I think then just spills out into a mixture of weight functions if everything else is held constant.
Mismatched Proposals
Suppose we have two proposal distributions and . You could think of these as "smart" and "dumb" reversible jump proposals, but in the simplest case they could even just be random walk proposals from Gaussians with different standard deviations, e.g., and respectively.
It's clearly valid to propose from the mixture proposal as long as we also compute the acceptance probability under the mixture. And of course it's valid to apply MH to each proposal individually, and then operate under the mixture kernel that first chooses a proposal distribution and then computes the acceptance ratio specific to that proposal.
But it's also valid to do something a bit trickier. Consider the "mismatched" acceptance ratios
in which we propose from one distribution but compute the reverse probability under the other distribution. We can use these to construct MH transition kernels
where and are rejection probabilities defined analogously to Section~\ref{sec:mh} above. These transition kernels will not individually have detailed balance, but the mixture kernel,
in which we choose at random which distribution to propose from, but then compute a "mismatched" acceptance probability, does in fact have detailed balance.
To show this, we need to prove
or equivalently
for all pairs . This is trivially true when , so we need only consider the case . Let
denote the inner parts of the acceptance ratios and respectively. Then assuming we have
This expression is annoying due to all the mins, so we break it into four cases: , and . The first and second two cases are symmetric wrt swapping and , so we need consider only the first two. In the first case , we have
thus establishing detailed balance in this case. In the second case , we have
and together with the previous case (and the symmetric cases) this establishes detailed balance for our mixture kernel.
This argument establishes the validity of a wide range of reversible jump flavored proposals without the need to split hairs regarding the domain of individual moves ("this distribution is a birth on states with an odd number of objects but a death on states with an even number") or rely on a degenerate mixture proposal (propose from a mixture of the birth and death move, where for any pair of states one or the other will have zero probability since the dimensions won't match). It means we can feel free in smashing together any two moves and as the "reversals" of each other, without having to contort ourselves into constructing any single proposal that actually reverses itself.
Note that for reversible jump moves it will often be the case that one of and is zero, for example, if is a birth move and is a death move. It is straightforward to check that the derivation goes through in this case.
TODO: It feels like there should be a more "intuitive" explanation of this fact that doesn't rely on multiple cases of algebraic manipulations of transition probabilities.