mcmc notes: Nonlinear Function
Created: September 14, 2014
Modified: March 16, 2022

mcmc notes

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Note: these are personal notes, taken as I was refreshing myself on this material. They're mostly stream of consciousness and probably not very rigorous or useful to anyone else. The main sources are Wikipedia and Andrieu et al, "An Introduction to MCMC for Machine Learning".

In AI, we often represent knowledge about the world in the form of a graphical model, in which we represent the joint distribution over a set of random variables as a product of simpler, local terms involving only a subset of the variables. In a directed graphical model, these are the conditional distributions of each variable given its parents; in an undirected model they are factors associated with cliques of the graph (or more generally, associated with any subsets of the variables). We are interested in inference, which generally means obtaining P(QE)P(Q|E) where EE is a subset of variables we have observed as evidence, and QQ is a different subset constituting a query for which we would like the posterior distribution of values given the observed evidence (sometimes we also want the most likely value, but we'll not focus on that right now).

MCMC finds approximate posterior distributions through sampling. We construct a Markov chain whose stationary distribution is the distribution of interest, and then run the chain to produce samples. The samples we get are not independent, since the chain will not have time to mix between successive samples, but given sufficiently many samples we can get a good estimate of the distribution.

Markov Chains

The state space is the set of all possible joint assignments of values to the random variables in the model, each of which we call a state. A Markov chain is defined by function Tt(x,x)T_t(x, x') giving the probability of transition from state xx to state xx' at time tt (in the special case of a time-homogeneous chain we omit the index tt). The function TtT_t is called a transition operator or transition kernel. If the state space is finite, this transition operator can be expressed as a left stochastic matrix TtT_t. Remember that a left stochastic matrix is a matrix in which all the entries are nonnegative and the columns sum to 1. Thus if we have a probability distribution over states expressed in a vector π\pi, where iπi=1\sum_i \pi_i = 1, then TtπT_t\pi gives the distribution after one step of the Markov chain.

There are some properties which a Markov chain (or its states) can have:

  • (ir)reducibility. We say that a state xx is accessible from xx' if there is some number of steps nn such that we have nonzero probability of transitioning to state xx' after taking nn steps from xx. A chain is irreducible if every state is accessible from every other state. If we drew the state space as a graph, with edges labeled by transition probabilities and omitted for transitions with zero probability, then the graph of an irreducible chain has a single connected component. (note however that there are also graphs with a single connected component corresponding to reducible chains, since accessibility is a directed relationship. We say that two states communicate if they are mutually accessible to each other; then a chain is irreducible iff all states communicate).
  • periodicity. A state is aperiodic if there is some nn such that, for all n>nn' > n, there is a path of length nn' having nonzero probability that returns to that same state. By contrast, a periodic state is one which can only be returned to at times which are integer multiples of some period k>1k > 1. For example, the states in a bipartite graph are periodic with period 2. A chain is aperiodic iff all of its states are aperiodic. Note that if one state in an irreducible chan is aperiodic, then all of them are, since we can always return to that one state and spend an arbitrary amount of time dicking around.
  • Recurrence. A state is transient if, starting in that state, there is nonzero probability that we will never return to that state. A state is recurrent if it is not transient. The mean recurrence time is the expected time to return to the state. Note that a recurrent state does not necessarily have a finite mean recurrence time; for example, if the probability of returning after nn steps is 1/n21/n^2, then the mean recurrence time is given by nn1n2\sum_{n} n \cdot \frac{1}{n^2}, which diverges. A state is positive recurrent if its mean recurrence time is finite. Positive recurrence is important because it implies that the state will be visited infinitely often.
  • Ergodicity. A state is ergodic if it is aperiodic and positive recurrent. If all states in an irreducible Markov chain are ergodic, then we say that the chain itself is ergodic.

A distribution over states π\pi is called a stationary distribution if it is a fixed point of the transition operator TT. An irreducible chain has a stationary distribution if and only if all of its states are positive recurrent; in this case the stationary distribution probabilities are just the reciprocals of the mean recurrence times, normalized to sum to 1 (and thus the stationary distribution is unique). Note that even aperiodic chains can have stationary distributions, e.g., an "alternating" chain with two states that each transition to the other with probability 1 has a stationary distribution π=[0.5,0.5]\pi = [0.5, 0.5]. However, only an aperiodic chain is guaranteed to * converge} to its stationary distribution, i.e. have limnpijπj  i\lim_n p_{ij} \to \pi_j \forall \; i.

If the state is not irreducible, it may have several stationary distributions: one for each communicating class (set of states which communicate with each other), and then also all convex combinations of those.

If a chain is not time inhomogeneous, it can still have a stationary distribution. This is the case if we have multiple transition operators that each respect the same shared stationary distribution. Note that given transition operators T1T_1 and T2T_2 each having stationary distribution π\pi, we can define the mixture kernel αT1+(1α)T2\alpha T_1 + (1-\alpha) T_2 (corresponding to applying T1T_1 at each step with probability α\alpha and T2T_2 otherwise) and the cycle kernel T1T2T_1T_2 (corresponding to alternating between the two operators in sequence), and both of these will also have stationary distribution π\pi.

A further property is reversibility or detailed balance. This is the condition that there exists some distribution over states π\pi such that, for all i,ji,j,

πiP(xt+1=jxt=i)=πjP(xt+1=ixt=j).\pi_i P(x_{t+1} = j | x_t = i) = \pi_j P(x_{t+1} = i | x_t=j).

It is easy to show (by summing both sides over ii) that any such distribution must be a stationary distribution. Thus, if we can construct a chain to be in detailed balance with respect to some distribution π\pi, then the chain must have π\pi as a stationary distribution. This is usually how MCMC constructions work.

Summary. A stationary distribution exists as long as the chain is positive recurrent. If it is irreducible, the stationary distribution is unique. If it is aperiodic, the chain will mix to the stationary distribution. Ergodicity summarizes all of these to give a condition for "nice" Markov chains. We can also show the existence of a stationary distribution through detailed balance; in this case, ergodicity guarantees that the stationary distribution is unique.

Metropolis Hastings

The Metropolis-Hastings (MH) algorithm is the basis of most MCMC algorithms used in practice. Here we introduce a proposal distribution q(xx)q(x' | x), from which we sample a proposal for a new state given the current state, and our transition kernel chooses to accept or reject the proposed new state according to its relative probability compared to the current state. Specifically, we compute an acceptance probability given by

A(x,x)=min(1,π(x)q(xx)π(x)q(xx))A(x', x) = \min \left(1, \frac{\pi(x') q(x | x')}{\pi(x) q(x' | x)} \right)

and then accept the proposed state xx' with probability A(x,x)A(x', x); otherwise it is rejected. If the proposal is rejected, we remain in the current state xx until the next time step. Thus the MH transition kernel is given by

TMH(x,x)=q(xx)A(x,x)+I(x=x)r(x)T_{MH}(x', x) = q(x' | x) A(x', x) + \mathbb{I}(x = x') r(x)

where I(x=x)\mathbb{I}(x = x') is an indicator function returning 1 if x=xx=x' and 0 otherwise, and r(x)r(x), the probability of a rejected proposal, is given by

r(x)=(1A(x,x))q(xx)dx.r(x) = \int (1 - A(x^*, x))q(x^* | x) dx^*.

It is easy to show that MH is in detailed balance with respect to π\pi. Note that when x=xx = x' we have detailed balance as a tautology, so we need only consider the case where xxx \ne x'. Here we find

π(x)TMH(x,x)=π(x)q(xx)min(1,π(x)q(xx)π(x)q(xx))=min(π(x)q(xx),π(x)q(xx))=π(x)q(xx)min(1,π(x)q(xx)π(x)q(xx))=π(x)TMH(x,x)\begin{align*} \pi (x) T_{MH} (x', x) &= \pi(x) q(x' | x) \min \left(1, \frac{\pi(x') q(x | x')}{\pi(x) q(x' | x)} \right) \\ &= \min \left(\pi(x) q(x' | x), \pi(x') q(x | x') \right) \\ &= \pi(x') q(x | x') \min \left(1, \frac{\pi(x) q(x' | x)}{\pi(x') q(x | x')} \right)\\ &= \pi(x') T_{MH} (x, x') \end{align*}

so MH is in detailed balance with respect to π\pi, and thus π\pi is a stationary distribution of the MH chain. To show that this stationary distribution is unique, we need to establish irreducibility; here it is sufficient that q(x)q(\cdot | x) is supported across the entire state space (technically we in fact only need that it is supported wherever π(x)\pi(x) is supported, since this established a communicating component that includes the entire stationary distribution). To show that we converge to the stationary distribution, we need the chain to be aperiodic; this is immediately established by the possibility of rejection at each step.

The independent sampler is the special case of MH in which the proposal distribution q(x)q(x') does not depend on the current state xx. This is kind of like rejection sampling, but still produces correlated samples since the acceptance probability of the next state is still determined by a comparison to the current state. The Metropolis sampler is the special case in which the proposal distribution is symmetric, i.e., q(xx)=q(xx)q(x' | x) = q(x | x'). Here the acceptance probability reduces to A(x,x)=min(1,π(x)π(x))A(x', x) = \min\left(1, \frac{\pi(x')}{\pi(x)}\right). This can be nice since it eliminates the need to compute q(x,x)q(x', x).

Derivation and Optimality

(from Chib, Greenburg, "Understanding the Metropolis-Hastings Algorithm" (1995))

We can actually construct the MH sampler as the optimal acceptance probability to maintain detailed balance for an arbitrary proposal distribution. Recall that detailed balance requires

π(x)T(xx)=π(x)T(xx),  (x,x)\pi(x)T(x'|x) = \pi(x')T(x|x'), \; \forall (x,x')

and suppose now that we have some proposal distribution q(xx)q(x'|x). If this proposal satisfies detailed balance, then we are done, but in general there will be some xxx \ne x' such that

π(x)q(xx)>π(x)q(xx),\pi(x)q(x'|x) > \pi(x')q(x|x'),

without loss of generality since we can always swap x,xx, x'. This inequality means that we move from xx to xx' "too often" relative to the reverse move. We'd like to rejectThere could be strategies to fix this problem other than rejection, but it is an obvious thing to consider. some of those moves to preserve the balance, so we introduce an acceptance probability α(x,x)\alpha(x', x), which we require to satisfy detailed balance, i.e.,

π(x)q(xx)α(x,x)=π(x)q(xx)α(x,x).\pi(x)q(x'|x)\alpha(x', x) = \pi(x')q(x|x')\alpha(x, x').

Solving this for the α\alpha's, we get the condition

α(x,x)α(x,x)=π(x)q(xx)π(x)q(xx).\frac{\alpha(x', x)}{\alpha(x, x')} = \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)}.

This is a condition on the ratio of the α\alpha's, meaning that we retain the freedom to scale them both by any constant. The obvious thing to do is choose the largest possible scaling constant, so that we maximize the α\alpha's, i.e., maximize the acceptance rate of the chain. Since the α\alpha's are probabilities, the largest possible value is 11, so we take α(x,x)=1\alpha(x, x')=1 (by our inequality above, we are assuming without loss of generality that this is the larger of the two α\alpha's) and therefore

α(x,x)=π(x)q(xx)π(x)q(xx),\alpha(x, x')= \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)},

where the ratio on the right is guaranteed to be less than 1, again by our assumption above. In general, this yields the MH rule

α(x,x)=min(1,π(x)q(xx)π(x)q(xx))\alpha(x, x') = \min\left(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)}\right)

in which the min\min operator automatically chooses the appropriate direction in which to restrict the acceptance probability. And as we showed, this is optimal within the space of acceptance rules achieving detailed balance, in the sense that it will accept all moves in the underepresented direction, and then only reject in the other direction with probability required to satisfy detailed balance.

Random Choice of Proposal}

(most of this taken from Matthias Seeger's Notes on Metropolis Hastings)

Suppose we have several proposal distributions q(xx,β)q(x' | x, \beta) indexed by β\beta. These may help us explore the space in different ways. Since the stationary distribution of the MH kernel is determined by construction, and doesn't depend on the proposal distribution, we can thus define a MH kernel for each proposal distribution and these kernels will all share the same stationary distribution π\pi. Thus we can use them in sequence, or we can randomly sample a different proposal to use at each step, and the corresponding cyclic or mixture kernel will also have stationary distribution π\pi. What if not all of the proposals are supported across the entire state space (e.g. if each one only has the potential to modify a subset of the variables)? I believe (without having worked out a formal proof) that everything works out as long as the composition of the proposals (in the cyclic case) or a probabilistic mixture of the proposals (in the mixture case) is supported across the entire space.

The more interesting case comes when we want to choose which proposal to use in a way that depends on the current state xx. Here we have a distribution p(βx)p(\beta|x), so we could consider the marginal proposal

q(x,x)=q(xx,β)p(βx)dβ.q^*(x', x) = \int q(x' | x, \beta)p(\beta|x) d\beta.

But this may not be easy to compute, so we'd like to be able to just sample β\beta from p(βx)p(\beta|x) at each step and then use the resulting proposal distribution q(xx,β)q(x'|x,\beta). This process gives the same sampling distribution for xx' as the marginal proposal above, but the acceptance ratio is different since we're only evaluating the proposal density for a point estimate of β\beta. Unfortunately, this does not necessarily give us detailed balance and so will not necessarily maintain the stationary distribution. In proving detailed balance for standard MH, the relevant part of the transition kernel was q(xx)A(x,x)q(x'|x)A(x', x) and the key observation was that the expression min(π(x),q(xx,β),π(x)q(xx,β))\min\left(\pi(x), q(x'|x,\beta) , \pi(x')q(x|x',\beta) \right) is symmetric with respect to x,xx, x'. Now, the relevant part of our transition kernel is q(xx)A(x,x)p(βx)dβ\int q(x'|x)A(x', x)p(\beta|x) d\beta, where we are marginalizing the kernel over β\beta. Note that this is different from marginalizing the proposal over β\beta, since now we are accepting or rejecting based on the probabilities from the specific proposal q(xx,β)q(x'|x,\beta) we sampled rather than using the marginal proposal. Now detailed balance will require

π(x)qβ(xx)Aβ(x,x)p(βx)dβ=π(x)qβ(xx)Aβ(x,x)p(βx)dβ\pi(x) \int q_\beta(x'|x)A_\beta(x', x)p(\beta|x) d\beta = \pi(x')\int q_\beta(x|x')A_\beta(x, x')p(\beta|x') d\beta

which boils down to showing that the expression (appearing inside the integral)

min(π(x)qβ(xx,β)p(βx),π(x)βq(xx,β)p(βx))\min\left(\pi(x)q_\beta(x'|x,\beta)p(\beta|x), \pi(x')_\beta q(x | x',\beta)p(\beta|x)\right)

is symmetric with respect to x,xx, x'. The obviously difficulty is the p(βx)p(\beta|x) factor; we'd really like this expression to be

min(π(x)qβ(xx,β)p(βx),π(x)qβ(xx,β)p(βx))\min\left(\pi(x)q_\beta(x'|x,\beta)p(\beta|x), \pi(x')q_\beta(x | x',\beta)p(\beta|x')\right)

since this would in fact be obviously symmetric. Unfortunately, that wasn't how things worked out, but it does mean we can guarantee detailed balance in the special case where these expressions are equal, i.e. when

π(x)qβ(xx,β)p(βx)=π(x)qβ(xx,β)p(βx).\pi(x')q_\beta(x | x',\beta)p(\beta|x) = \pi(x')q_\beta(x | x',\beta)p(\beta|x').

When is this condition satisfied? Well, trivially it holds if p(βx)p(\beta|x) does not in fact depend on xx, as in the simple case described above. More generally, it holds if we can think of our state as being factored into several variables, where we are proposing to update a particular variable or variables but choosing β\beta using only the values of other variables that are left unchanged by the proposal.

What to do if the condition is not satisfied? If we can't compute the marginal proposal distribution, another possibility is to incorporate β\beta into the state space directly as an auxiliary variable.

We can also do the RJMCMC thing, which is equivalent to just implicitly including an auxiliary variable (see below). That is, we explicitly symmetrize by including the choice of proposal in the acceptance ratio, by solving for the necessary condition

Aβ(x,x)Aβ(x,x)=π(x)qβ(xx)p(βx)π(x)qβ(xx)p(βx)\frac{A_\beta(x', x)}{A_\beta(x, x')} = \frac{\pi(x') q_\beta(x|x')p(\beta|x')}{\pi(x) q_\beta(x'|x)p(\beta|x)}

and optimizing as above to get

Aβ(x,x)=min(1,π(x)qβ(xx)p(βx)π(x)qβ(xx)p(βx)).A_\beta(x', x) = \min\left(1, \frac{\pi(x') q_\beta(x|x')p(\beta|x')}{\pi(x) q_\beta(x'|x)p(\beta|x)} \right).

That is, we're free to choose kernels intelligently based on the current state, but we have to account for the probability of making the opposite choice. In the RJMCMC case, this means we're free to choose an appropriate birth proposal based on the current state, but in order to accept we'll have to be willing to choose the corresponding death proposal from the new state we end up in.

To see this as an auxiliary variable method, we treat β\beta as an auxiliary variable with a uniform distribution (at least if the domain of β\beta is finite). Then it will always cancel out of π(x)\pi(x), and we don't actually need to keep track of it between steps. Now the proposal is really q(x,βx,βprev)q(x', \beta|x, \beta_\text{prev}) over the joint state, but we ignore the "previous value" βprev\beta_\text{prev} and just propose from q(x,βx)q(x', \beta|x) by the chain rule: q(x,βx)=q(xβ,x)q(βx)q(x', \beta|x) = q(x'|\beta, x)q(\beta|x), which is just the factor qβ(xx)p(βx)q_\beta(x'|x)p(\beta|x) under different notation, thus recovering the RJMCMC acceptance ratio.

WAIT WHAT? In the auxiliary variable construction we'd naively get q(x,βx)q(x', \beta|x) in the denominator, but q(x,βprevx)q(x, \beta_\text{prev}|x') in the numerator. Which is not the RJMCMC rule, since we can't identify the two β\betas. We don't track βprev\beta_\text{prev}, but we can always materialize it from the target uniform distribution. But if we did this we would not in general get high acceptance probabilities. So I still believe the RJMCMC argument is valid, but I don't think the auxiliary model argument justifies it.

Simulated Annealing

If we only want the highest-probability state, we can run MH with a non-homogeneous kernel, constructed so that the invariant distribution at time tt is proportional to π1/Ct\pi^{1/C_t}, where CtC_t is a cooling schedule such that Ct0C_t \to 0 as tt\to \infty. Then in the limit, we have a stationary distribution π\pi^\infty in which all the mass is at the global maxima. This is called simulated annealing. It works well if the proposal distribution is good, so the chain mixes well, and if the cooling schedule is appropriate to the mixing rate of the chain (e.g., we don't want to cool all the way to zero before the chain has even had time to mix).

Gibbs Sampling

A special case of MH is when we have separate proposals for each variable xix_i, and we choose the proposal distribution to be the conditional distribution of that variable given all the other variables, i.e.

q(xix)=p(xixi)q(x_i' | x) = p(x_i | x_{-i})

where we use xix_{-i} to denote all but the iith component of xx. In a graphical model, this reduces to the conditional distribution of the variable given its Markov blanket: in a directed model, its parents, children, and mates (other parents of its children); in an undirected model, just its neighbors.

Given this proposal distribution, we have acceptance probability

min(1,π(x)p(xixi)π(x)p(xixi))\min\left(1, \frac{\pi(x') p(x_i | x'_{-i})}{\pi(x) p(x'_i | x_{-i})}\right)

We can expand out the stationary probabilities using the chain rule π(x)=p(xixi)p(xi)\pi(x) = p(x_i | x_{-i}) p(x_{-i}) to get

min(1,p(xixi)p(xi)p(xixi)p(xixi)p(xi)p(xixi))\min\left(1, \frac{p(x'_i | x'_{-i}) p(x'_{-i}) p(x_i | x'_{-i})}{p(x_i | x_{-i}) p(x_{-i}) p(x'_i | x_{-i})}\right)

Now the key is to realize that xi=xix_{-i} = x'_{-i}; i.e. the current and next state agree on everything except ii. If we make this substitution, writing

min(1,p(xixi)p(xi)p(xixi)p(xixi)p(xi)p(xixi)),\min\left(1, \frac{p(x'_i | x_{-i}) p(x_{-i}) p(x_i | x_{-i})}{p(x_i | x_{-i}) p(x_{-i}) p(x'_i | x_{-i})}\right),

then we see that every term cancels, so our acceptance probability is always 1! Thus a Gibbs move is always accepted.

Since we have lots of individual proposals, one for each variable, we can either randomly sample a different variable to update at each step, creating a mixture kernel, or more typically just choose an ordering in which to update the variables, creating a cyclic kernel. Either one is fine, as discussed above. Note that we don't have to update the variables with equal frequency. We do, however, have to be careful about selecting which variable to sample in a way that depends on the current values; in general this would not preserve the stationary distribution of the chain.

Multiple Try Metropolis / Hit-And-Run

Multiple Try

(ref: Jun S Liu; Faming Liang; Wing Hung Wong, "The multiple-try method and local optimization in metropolis sampling", 2000)

Multiple-try allows us to make a bunch of proposals, then choose (approximately) the best one. Let g(x,y)=π(x)q(yx)g(x, y) = \pi(x)q(y|x). Now starting in a state xx, we draw kk proposals y1,ykq(x)y_1, \ldots y_k \sim q(\cdot|x), and compute g(yi,x)g(y_i, x) for i=1,,ki=1,\ldots,k (note these are the denominators of the acceptance ratios for the individual proposals). Now sample a yy from the yiy_i's with probability proportional to g(yi,x)g(y_i, x). Given this yy, let x1,,xk1x^*_1, \cdots, x^*_{k-1} be samples from q(y)q(\cdot | y), and let xk=xx_k=x. Now accept the move to yy with probability

min(1,=1kg(yi,x)g(x,y)+i=1k1g(xi,y)).\min\left(1, \frac{\sum_{=1}^k g(y_i, x)}{g(x, y) + \sum_{i=1}^{k-1} g(x_i, y)}\right).

It turns out (see the ref cited, the proof is not hard) that this satisfies detailed balance.

Hit-and-Run

Basic idea: select a direction to move in by sampling from the unit ball surrounding the current state. Then

  • If possible, sample from the Gibbs distribution along that direction. This is just a Gibbs step in a reparameterized space and, therefore, automatically accepted.
  • If it's not obvious how to sample directly from the Gibbs distribution, can use multiple-try metropolis: sample a bunch of candidate distances from a uniform, Gaussian whatever proposal distribution, and then follow the multiple-try procedure to choose one of them and accept/reject the final proposal (I think this is sometimes called "random ray" Monte Carlo). \end{itemize}

Generalized Hit-and-Run

Delayed Rejection

(ref: Antonietta Mira, "On Metropolis-Hastings algorithms with delayed rejection")

Suppose we have two proposals, q1q_1, and q2q_2. We perform a standard MH move with proposal q1q_1, yielding acceptance ratio

α1(x,y)=min(1,π(y)q1(xy)π(x)q1(yx)).\alpha_1(x, y) = \min\left(1, \frac{\pi(y)q_1(x|y)}{\pi(x)q_1(y|x)}\right).

If this proposal is rejected, we can delay rejection and make a second proposal from q2q_2, where, notably, our second proposal can depend on the value we just rejected. To figure out how to do this, consider the probability of moving from xx to zz,

T(zx)=p(propose zfrom q1)p(q1 proposalaccepted)+[p(propose yfrom q1)p(q1 proposalrejected)p(propose zfrom q2)p(q2 proposalaccepted)]dy=q1(zx)α1(x,z)+[q1(yx)[1α1(x,y)]q2(zx,y)α2(x,y,z)]dy.\begin{align*} T(z|x) &= p\left(\begin{array}{c}\text{propose }z\\\text{from } q_1\end{array}\right) p\left(\begin{array}{c}q_1 \text{ proposal}\\\text{accepted}\end{array}\right) + \int \left[ p\left(\begin{array}{c}\text{propose }y\\\text{from }q_1\end{array}\right) p\left(\begin{array}{c}q_1\text{ proposal}\\\text{rejected}\end{array}\right) p\left(\begin{array}{c}\text{propose }z\\\text{from }q_2\end{array}\right) p\left(\begin{array}{c}q_2\text{ proposal}\\\text{accepted}\end{array}\right) \right]dy\\ &= q_1(z|x)\alpha_1(x, z) + \int \left[ q_1(y|x) \left[1-\alpha_1(x, y)\right] q_2(z | x, y) \alpha_2(x, y, z)\right]dy. \end{align*}

We want to find α2\alpha_2 such that detailed balance holds, i.e., π(x)T(zx)=π(z)T(xz)\pi(x)T(z|x) = \pi(z)T(x|z). We know by construction of α1\alpha_1 that the first term satisfies detailed balance:

π(x)q1(zx)α1(x,z)=π(z)q1(xz)α1(z,x)\pi(x) q_1(z|x)\alpha_1(x, z) = \pi(z) q_1(x|z)\alpha_1(z, x)

so it remains to impose detailed balance on the second term:

π(x)q1(yx)[1α1(x,y)]q2(zx,y)α2(x,y,z)dy=π(z)q1(zx)[1α1(z,y)]q2(xz,y)α2(z,y,x)dy.\pi(x)\int q_1(y|x) \left[1-\alpha_1(x, y)\right] q_2(z | x, y) \alpha_2(x, y, z) dy = \pi(z)\int q_1(z|x) \left[1-\alpha_1(z, y)\right] q_2(x | z, y) \alpha_2(z, y, x) dy.

It is sufficient (though perhaps not necessary) to require that the condition hold for * all} values of yy, i.e.,

π(x)q1(yx)[1α1(x,y)]q2(zx,y)α2(x,y,z)=π(z)q1(zx)[1α1(z,y)]q2(xz,y)α2(z,y,x)\pi(x) q_1(y|x) \left[1-\alpha_1(x, y)\right] q_2(z | x, y) \alpha_2(x, y, z) = \pi(z) q_1(z|x) \left[1-\alpha_1(z, y)\right] q_2(x | z, y) \alpha_2(z, y, x)

which yields the ratio condition

α2(x,y,z)α2(z,y,x)=π(z)q1(zx)[1α1(z,y)]q2(xz,y)π(x)q1(yx)[1α1(x,y)]q2(zx,y).\frac{\alpha_2(x, y, z)}{\alpha_2(z, y, x)} = \frac{\pi(z) q_1(z|x) \left[1-\alpha_1(z, y)\right] q_2(x | z, y)}{\pi(x) q_1(y|x) \left[1-\alpha_1(x, y)\right] q_2(z | x, y)}.

By the same argument as the MH derivation above, the optimal choice is

α2(x,y,z)=min(1,π(z)q1(zx)[1α1(z,y)]q2(xz,y)π(x)q1(yx)[1α1(x,y)]q2(zx,y)).\alpha_2(x, y, z) = \min\left(1, \frac{\pi(z) q_1(z|x) \left[1-\alpha_1(z, y)\right] q_2(x | z, y)}{\pi(x) q_1(y|x) \left[1-\alpha_1(x, y)\right] q_2(z | x, y)}\right).

Using this α2\alpha_2 as the acceptance ratio, we can try a "second proposal" q2q_2 if q1q_1 fails, at some additional computational cost since we have to compute the proposal density q1(zx)q_1(z|x). It's possible to do multiple rounds of delayed rejection, but the complexity rises quickly.

Hamiltonian Monte Carlo (HMC)

Suppose that our target distribution π(x)\pi(x) is differentiable and we know its (log) gradient. How can we use this in a MCMC algorithm?

Short answer: see Neal (2011), and http://mlg.eng.cam.ac.uk/tutorials/06/im.pdf. The next few sections are just cribbed from the Neal article.

Hamiltonian dynamics

The Hamiltonian H(q,p)H(q,p) is a function of the position qq and momentum pp of all particles in the state (in general, if we have kk particles in a dd-dimensional space, then qq and pp will each be kdkd-dimensional vectors). Given this function HH, the dynamics of the system are defined by differential equations:

dqidt=Hpi\frac{d q_i}{d t} = \frac{\partial H}{\partial p_i}
dpidt=Hqi\frac{d p_i}{d t} = -\frac{\partial H}{\partial q_i}

In Hamiltonian Monte Carlo, we'll consider Hamiltonian functions of the form

H(q,p)=U(q)+K(p)H(q,p) = U(q) + K(p)

where U(q)U(q), a function of the current position, is called the potential energy, and K(p)K(p), a function of momentum, is the kinetic energy. Using HMC to sample from an (unnormalized) density π\pi, we typically define the potential energy

U(q)=logπ(q)U(q) = -\log \pi(q)

and kinetic energy

K(p)=pTM1p/2K(p) = p^T M^{-1} p /2

where MM is a PSD "mass matrix", typically diagonal; this functional form corresponds to the unnormalized log-density of a zero-mean Gaussian with covariance MM. In this setting, we have

dqidt=[M1p]idpidt=logπ(q)qi,\begin{align*} \frac{dq_i}{dt} &= [M^{-1}p]_i\\ \frac{dp_i}{dt} &= \frac{\partial \log \pi(q)}{\partial q_i}, \end{align*}

which says, intuitively, that the rate of change of position is given by the (scaled) momentum, and the rate of change of momentum is given by the slope of the log density, i.e., the rate at which we are "falling" downhill. Note that by integrating over these dynamics we obtain a mapping TsT_s from the state at time tt to the state at time t+st+s.

Hamiltonian dynamics have several important properties. Reversibility means that given the state at time t+st+s, we can apply an inverse operator TsT_{-s} to obtain the state at time tt; this inverse is obtained by negating the time derivatives, or in our particular setting, by negating the momentum, applying TsT_s, then negating pp again. Conservation means that the Hamiltonian is invariant over time: dHdt=0\frac{dH}{dt} = 0. And volume preservation means that if we apply the operator TsT_s to some region RR in (q,p)(q,p) space, the image of RR under TsT_s will have the same volume as RR: this can be shown by analyzing the determinant of the Jacobian of TsT_s. Volume preservation is apparently a consequence of a more general property, symplecticness, that I have no intuition for.

Discretization

Unfortunately Hamiltonian dynamics are defined implicitly as a differential equation; in general the forward operator TsT_s will not have a nice analytic form. We can, however, simulate the dynamics on a computer using a discrete approximation. The obvious thing to try is Euler's method: given the state (q,p)(q,p) at time tt, compute the state at time t+εt+\varepsilon by taking gradient steps of size ε\varepsilon.

A modified Euler's method, with better numerical properties, is to take a gradient step on pp, then take a gradient step on qq (using the new value of pp), instead of updating both variables simultaneously:

pu(t+ϵ)=pi(t)ϵUqi(q(t))qi(t+ϵ)=qi(t)+ϵpi(t+ϵ)mi\begin{align*} p_u(t + \epsilon) &= p_i(t) - \epsilon \frac{\partial U}{\partial q_i} (q(t))\\ q_i(t + \epsilon) &= q_i(t) + \epsilon \frac{p_i (t+\epsilon)}{m_i} \end{align*}

This method can be shown to exactly preserve volumeThe update for each variable is a "shear" transformation -- updates to a subset of variables that depend only on a different subset of variables -- which has a Jacobian matrix with determinant one. For example, in the update for pp, the new pp is just the old pp plus some stuff that depends on qq, and the new qq is just the old qq. So the diagonals of the Jacobian are ones, and the fact that qq doesn't change makes one of the off-diagonals zero, so the determinant is 1., so it avoids divergence that comes from volumes expanding to infinity or contracting to zero (i.e., spiraling into the origin). However, it is not reversible: if we end up at state (q1,p1)(q_1, p_1), and try to take a step with the reverse momentum p1-p_1, the first thing we do is update the momentum based on the gradient at q1q_1, and now our next move on qq will follow that new momentum instead of the one that got us from q0q_0 to q1q_1, so we won't end up back at q0q_0. You might try to fix this by considering the opposite procedure (the one which updates qq, then pp), but this has the same problem just pushed back one step: starting at q2q_2 we will take the correct reverse move to q1q_1 but now we will update p2-p_2 using the gradient at q1q_1, when the actual update to go from p2-p_2 to p1-p_1 would have used the gradient at q2q_2.

To preserve reversibility, we make one final tweak to yield the leapfrog method:

pu(t+ϵ/2)=pi(t)(ϵ/2)Uqi(q(t))qi(t+ϵ)=qi(t)+ϵpi(t+ϵ/2)mipi(t+ϵ)=pi(t+ϵ/2)(ϵ/2)Uqi(q(t+ϵ))\begin{align*} p_u(t + \epsilon/2) &= p_i(t) - (\epsilon/2) \frac{\partial U}{\partial q_i} (q(t))\\ q_i(t + \epsilon) &= q_i(t) + \epsilon \frac{p_i (t+\epsilon/2)}{m_i}\\ p_i(t+\epsilon) &= p_i(t + \epsilon/2) - (\epsilon/2) \frac{\partial U}{\partial q_i} (q(t+\epsilon)) \end{align*}

These are all still shear transformations, so the leapfrog preserves volume. Even better, the leapfrog is reversible: if we negate pp at the end of the trajectory and apply the updates again, we'll end up back where we started (after negating pp a final time). This follows just from the symmetry of the update rules. Essentially we're updating the momentum based on the halfway points of our jumps, rather than at the "start" or the "end", and flipping the start and end of the jump doesn't change the halfway point.

It can be shown that the leapfrog introduces single-step error of order ϵ3\epsilon^3, vs ϵ2\epsilon^2 for the two non-reversible methods.

MCMC with Hamiltonian Dynamics

In general, given an energy function E(x)E(x), we can define the canonical distribution over states

P(x)=1Zexp(E(x)/T)P(x) = \frac{1}{Z}\exp(-E(x)/T)

for some temperature TT and normalizing constant ZZ. The Hamiltonian is an energy function, so

P(q,p)=1Zexp(H(q,p)/T)P(q,p) = \frac{1}{Z}\exp(-H(q,p)/T)

defines a joint distribution on qq and pp. Since the Hamiltonian is invariant under Hamiltonian dynamics, energy will be conserved, i.e. a perfectly simulated Hamiltonian trajectory will move within a surface of constant probability density in the joint space. Note that this joint density will factor into two canonical distributions: the target density π(q)\pi(q) and a distribution on the momentum pp. Using the standard quadratic kinetic energy, the distribution on pp is a zero-mean Gaussian with covariance MM, though other choices are possible.

The standard Hamiltonian MCMC algorithm is:

  1. Resample the momentum pp from its Gaussian prior.
  2. Run some number LL of leapfrog steps with step size ϵ\epsilon to propose a new state (p,q)(p*,q*).
  3. Accept or reject this proposal with a standard Metropolis-Hastings update. By the reversibility and volume-preserving properties of leapfrog dynamics, the proposal is symmetric, so the acceptance probability reduces to the Metropolis ratio
    min(1,P(p,q)P(p,q))=min(1,π(q)N(p;0,M)π(q)N(p;0,M))\min\left(1, \frac{P(p^*,q^*)}{P(p,q)}\right) = \min\left(1, \frac{\pi(q^*)N(p^*; 0, M)}{\pi(q)N(p; 0, M)}\right)

The first step, resampling the momentum, is valid because the target density of pp is just its prior, so resampling from the prior preserves stationarity, and pp and qq are independent. But why do we need to do this? First, there is the nice property that we don't actually need to store the momentum variables between steps, but this is incidental. The real reason is that the Hamiltonian, thus the joint density, is (approximately) preserved under (approximate) Hamiltonian dynamics, so we need an additional step to allow us to jump to different levels of the joint density. In particular, if we didn't do the resampling, the potential energy U(q)U(q) could never exceed the initial total energy H(q,p)H(q,p), so we would never be able to explore regions of low probability density (high potential energy).

The second step, running the Hamiltonian dynamics using the leapfrog integrator, is just an arbitrary deterministic function of the current state: the fact that it is inspired by physics is irrelevant to the structure of the MCMC algorithm. If we were running true Hamiltonian dynamics, then the Hamiltonian and thus the joint density would be preserved by this step, and the MH acceptance probability would always be 1. However, the leapfrog discretization introduces changes in the Hamiltonian, so we need to include the MH step. Note that if we wanted to use a non-reversible set of discrete dynamics (e.g. Euler's method), we would additionally need to compute the forward and backward proposal densities, which is tricky since it involves integrating over all possible momentums you could have resampled.

Note that the momentum resampling would be valid as its own step, but the second component (Hamiltonian dynamics) would not, because the overall symmetry of the proposal depends on the fact that N(p;0,M)=N(p;0,M)N(p; 0, M) = N(-p; 0, M), i.e. the ability to reverse the trajectory by negating the momentum. The Hamiltonian move by itself is deterministic, and not its own inverse, so the probability of returning to the initial state under is zero unless we include the momentum resampling.

The previous arguments combine to motivate the algorithm and show that it leaves the canonical distribution invariant. Of course, the null algorithm also leaves the canonical distribution invariant. For HMC to be valuable, it needs to actually mix to the canonical distribution, i.e. the chain needs to be ergodic. This is apparently implied in general by the fact that the momentum distribution is supported everywhere, and arbitrary values of momentum can move us to arbitrary points in the space (not obvious to me why this is true?). But there's a problem in practice if our step size is such that the discrete dynamics exhibits periodic behavior; one solution is to sample the step size randomly from some small interval.

Alternate Dynamics

Since the Hamiltonian dynamics are ultimately just an MH proposal, we can introduce lots of approximations, as long as the proposal remains symmetric and volume-preserving. For example, we could run the leapfrog on an approximate Hamiltonian, e.g. a GP approximation to the true density. Or we can impose parameter constraints by including "barriers" that we bounce off of. Several other possibilities are described below.

Windowed HMC

Plotting the value of HH over the course of a leapfrog trajectory, we might find that it oscillates around the original value, and our acceptance probability might be low or high depending on where exactly this oscillation is when the trajectory ends.

Windowed HMC exploits this by operating on sequences of states instead of just single states. For a given state (q0,p0)(q_0,p_0), we can imagine the infinite sequence of states ,(q1,p1),(q0,p0),(q1,p1),\ldots, (q_{-1}, p_{-1}), (q_0, p_0), (q_1, p_1), \ldots defined by forward and backward leapfrog steps from (q0,p0)(q_0, p_0). More concretely, imagine the set of finite length-WW substrings of this sequence that contain (q0,p0)(q_0,p_0). Windowed HMC corresponds to mapping from (q0,p0)(q_0, p_0) to a uniformly chosen member of that set, performing leapfrog steps to extend the trajectory forwards to have length LL (where we'd generally choose L2WL\ge 2W), then proposing the final WW steps of the resulting trajectory as the "new state" in the higher dimensional space. This proposal is accepted or rejected by MH using the density induced on these length-WW sequences by the mapping we described. The result will be a length-WW sequence, chosen from either the beginning or end of the trajectory depending on whether the move was accepted or not. We finally map this down to a single state by sampling from the states in this sequence with probabilities proportional to their densities under the canonical distribution. This overall step can be shown to leave the canonical distribution invariant.

In general, windowed HMC has lower rejection rates because the sequences effectively "smooth" over their components, avoiding the oscillation in values of HH. Apparently this can help a lot in practice. An additional nice property is that, even when a move is 'rejected' in the high-dimensional space, we end up resampling a new base state from a window surrounding the start state, so the chain still progresses.

As an implementation trick, instead of storing all 2W2W states from the accept and reject windows, we can sample a single "accept state" and "reject state" in an online fashion as we're computing the relevant portions of the trajectories, by keeping track of a candidate (q,p)(q,p) and replacing it with a candidate (qi,pi)(q_i, p_i) with probability equal to p(qi,pi)/si+1p(q_i, p_i)/ s_{i+1} where si+1=si+p(qi,pi)s_{i+1} = s_{i} + p(q_i, p_i) is recursively defined as the sum of densities at the points generated thus far.

NUTS

(ref: http://arxiv.org/abs/1111.4246)

Suppose that the goal of an HMC step is to travel as far as possible from the initial point xx. We can analyze this using the derivative of the squared distance (x~x)2(\tilde{x} - x)^2:

ddx(x~x)2(x~x)ddx(x~x)=(x~x)r~\frac{d}{dx} (\tilde{x} - x)^2 \propto (\tilde{x} - x) \frac{d}{dx} (\tilde{x} - x) = (\tilde{x} - x) \cdot \tilde{r}

which we see is proportional to the dot product of x~x\tilde{x} - x, the vector from our initial position to the current position, with the momentum r~\tilde{r}. If this quantity becomes negative, we are moving back towards the initial position, thus making negative progress. So the naive thing to try would be to run the leapfrog dynamics until this quantity becomes negative. Unfortunately this is not reversible; the NUTS sampler is essentially a trick to construct a reversible move from this naive idea. It seems a bit complicated and involves building trees and such.

There's a separate issue of choosing the optimal step size ϵ\epsilon. One approach is Robbins-Munroe optimization: whenever a move is rejected, decrease the step size, and increase the step size when a move is accepted, using a decreasing learning rate ηt\eta_t such that tηt=\sum_t \eta_t = \infty and tηt2<\sum_t \eta_t^2 < \infty. The problem is that the decreasing learning rate does most of its adaptation early on, when the optimal step size during the burnin phase might actually be quite different from the optimal value at convergence. Instead, NUTS uses a dual averaging update that (apparently) avoids this issue. It's possible to separate this from the no-U-turn approach to setting trajectory lengths, i.e., we can run HMC with fixed LL while using dual averaging to adapt ϵ\epsilon, or we can run the no-U-turn sampler with a fixed ϵ\epsilon.

Short-cut trajectories

Instead of adapting the step-size parameters, we could just propose them from a very broad distribution (e.g. log-Normal with high variance) and then set up the steps in such a way that the step takes very little time for a bad value of ϵ\epsilon. This way the parameter will automatically "adapt" (in a computational sense) to an appropriate value for different parts of the state space, without our having to do any explicit adaptation.

One way to do this is through trajectory reversal. We can group our LL leapfrog steps into blocks of size kk, and after each block, test to see if the standard deviation of HH values for that block is within some acceptable window (too high means stepsize is too large, too low means the opposite). If the test fails, we reverse the trajectory, moving back towards the initial state (by reversibility, these steps are all known and don't need to be computed!) and then off in the opposite direction. If another block of kk steps fails the test, we reverse again, and at this point all steps in the trajectory have been computed (we will just bounce back and forth between the reversal points until we've taken LL steps)! So we've escaped from the bad stepsize at greatly reduced computational cost. By contrast, if the stepsize is good, we will likely never reverse and thus will end up in the same place as always.

For this approach to be valid, the test for a block has to give the same result when applied to the reversed block. If this condition holds, we can see that whatever state we end in, if we undertake the reverse trajectory, then any block that caused us to reverse in the forwards trajectory will again cause us to reverse, so again the dynamics are fully reversed and we'll end up in the start state. Since this trick preserves reversibility, it is a valid symmetric Metropolis proposal.

Riemannian Manifold HMCMC

automatically adapt to correlation structure in high dimensions?

http://www.dcs.gla.ac.uk/inference/rmhmc/

http://arxiv.org/abs/1304.1920

Stochastic Gradient HMCMC

http://arxiv.org/pdf/1402.4102.pdf

Affine Invariant MCMC

(ref: Goodman, Weare, "Ensemble Samplers with Affine Invariance", 2010)

Many MCMC algorithms are very sensitive to affine transformations, i.e., they have a much harder time sampling from an anisotropic Gaussian than from an isotropic Gaussian, since the maximum step size is essentially limited by the standard deviation along the most constrained direction. For example, random-walk MH, Gibbs, and HMC all have this drawback (MH and HMC are rotation invariant, so they are not fazed by high correlations between variables, but they are still sensitive to variables of wildly varying scales). By contrast, an affine-invariant sampler is one in which the samples generated are unchanged if an affine transformation is applied to the space. That is, if we define a Markov chain by x(t+1)=R(x(t),ξ,π)x^{(t+1)} = R(x^{(t)}, \xi, \pi), where π\pi is the target density, ξ\xi is a string of random bits, and RR is a deterministic function defining the dynamics of the chain, then an affine-invariant chain has the property that

R(Ax+b,ξ,π)=AR(x,ξ,b)+bR(Ax + b, \xi, \pi) = AR(x, \xi, b) + b

for any affine transformation Ax+bAx+b. Goodman and Weare propose a method to achieve this property using an ensemble sampler, i.e. chain that tracks multiple "walkers" xk(t)x_k^{(t)} in parallel, so that the stationary distribution of each walker is the target distribution (though the walkers do not have to move independently, allowing the trajectories of individual walkers to be non-Markov since historical information regarding the trajectory of walker xix_i can be encoded in the current positions of the other walkers). The approach is based on the Nelder-Mead simplex method for deterministic optimization, which is also affine invariant.

The essential component is the "stretch move", which updates one ensemble component XkX_k conditioned on some other component XjX_j, by the move

Y=Xj+Z(Xk(t)Xj)Y = X_j + Z(X_k^{(t)} - X_j)

where ZZ is sampled from some density gg satisfying the condition g(1/z)=zg(z)g(1/z) = zg(z). One possible candidate is g(z)1/z,z[1/a,a]g(z) \propto 1/\sqrt{z}, z\in[1/a, a], and g(z)=0g(z)=0 otherwise, where a>0a>0 is a tuning parameter. Note that this move is affine invariant since it is just an affine transformation of XkX_k and XjX_j. The acceptance ratio is

min(1,Zn1π(y)π(xk(t)))\min\left(1, Z^{n-1} \frac{\pi(y)}{\pi(x_k^{(t)})}\right)

which I don't understand how to derive :-(. To to run the general algorithm, cycle through the walkers, and for each walker XkX_k, choose a random XjX_j to apply the stretch move. The updates can also be done in parallel: split the walkers into two partitions S(0)S^{(0)} and S(1)S^{(1)}, then alternate updating S(0)S^{(0)} in parallel given S(1)S^{(1)}, then the reverse. The easy parallelization of ensemble samplers can be very valuable when drawing lots of samples from an expensive-to-evaluate likelihood, e.g., in large scientific applications.

Another affine-invariant move is the "walk move", in which we choose a small set SS of walkers (say, of size three) independently of XkX_k, then propose from a Gaussian centered at XkX_k with covariance given by the empirical covariance of SS. Empirically, this seems not to work as well; one possible explanation is that a good proposal requires randomly selecting * three} walkers close to XkX_k, while a good proposal in the stretch move only requires selecting one such walker.

Parallel Tempering

Rao-Blackwellization

Reversible Jump MCMC

(most of this taken from David Hastie and Peter Green, "Model Choice using Reversible Jump Markov Chain Monte Carlo")

Reversible jump MCMC deals with the problem of sampling from a space with varying number of dimensions. This includes model choice problems, e.g. selecting between polynomial models of different degrees, as well as situations in which we have a single model with a variable-dimension parameter, e.g. number of cluster components.

General view of MH

Suppose we represent our state as a vector xx. A proposed next state can be given by a deterministic proposal function (x,u)=h(x,u)(x', u') = h(x, u), where ugu \sim g is a vector of random numbers sampled from some known density gg. Here xx' is a new state and uu' represents any other information we need to keep around in order for hh to be an invertible function. Let hh' be the inverse of hh, and note that a reverse move (x,u)=h(x,u)(x,u) = h'(x', u') will have to sample uu' from some distribution gg' (this * could} be the distribution on uu' obtained by sampling uu from gg, then passing through h(x,u)h(x,u) for some particular value of xx, but in general it doesn't have to be). Let α(x,x)\alpha(x, x') be the probability that the move from xx to xx' is accepted. Then we can write the detailed balance requirement as

(x,x)A×Bπ(x)g(u)α(x,x)dxdu=(x,x)A×Bπ(x)g(u)α(x,x)dxdu\int_{(x, x') \in A \times B}\pi(x)g(u)\alpha(x, x')dxdu = \int_{(x, x') \in A \times B} \pi(x')g'(u')\alpha(x', x)dx'du'

where AA and BB are any two arbitrary regions of the state space. This is the continuous generalization of the detailed balance condition given earlier, where we now require that probabilities of transition between any two regions of the state space must be reversible. We can substitute variables on the right side to arrive at

(x,x)A×Bπ(x)g(u)α(x,x)dxdu=(x,x)A×Bπ(x)g(u)α(x,x)(x,u)(x,u)dxdu\int_{(x, x') \in A \times B}\pi(x)g(u)\alpha(x, x')dxdu = \int_{(x, x') \in A \times B} \pi(x')g'(u')\alpha(x', x)\left|\frac{\partial(x', u')}{\partial(x,u)}\right| dxdu

where we now take (x,u)=h(x,u)(x', u') = h(x,u), and the term (x,u)(x,u)\left|\frac{\partial(x', u')}{\partial(x,u)}\right| is the Jacobian determinant introduced by the change of variables.Intuitively, you can think of the Jacobian as accounting for things like changes of units: a uniform distribution on distances between 0 and 1 kilometer has constant density 1.0 on that range, while a uniform distribution on 0 to 1000 meters has density 0.001 even though it's exactly the same distribution. So if we're switching from a state representation xx measured in kilometers to a representation xx' measured in meters (or a higher-dimensional representation denominated in, say, km2^2), then we have to adjust the densities accordingly. Note that a sufficient condition for this equality to hold is if we have

π(x)g(u)α(x,x)=π(x)g(u)α(x,x)(x,u)(x,u)\pi(x)g(u)\alpha(x, x') = \pi(x')g'(u')\alpha(x', x)\left|\frac{\partial(x', u')}{\partial(x,u)}\right|

for all xx and uu. Now if we choose an acceptance probability

α(x,x)=min{1,π(x)g(u)π(x)g(u)(x,u)(x,u)}\alpha(x, x') = \min\left\{1, \frac{\pi(x')g'(u')}{\pi(x)g(u)} \left|\frac{\partial(x', u')}{\partial(x,u)}\right|\right\}

then, substituting this back in we have

π(x)g(u)α(x,x)=π(x)g(u)min{1,π(x)g(u)π(x)g(u)(x,u)(x,u)}=min{π(x)g(u),π(x)g(u)(x,u)(x,u)}=min{π(x)g(u)(x,u)(x,u),π(x)g(u)}(x,u)(x,u)=π(x)g(u)min{π(x)g(u)π(x)g(u)(x,u)(x,u),1}(x,u)(x,u)=π(x)g(u)α(x,x)(x,u)(x,u),\begin{align*} \pi(x)g(u)\alpha(x, x') &= \pi(x)g(u)\min\left\{1, \frac{\pi(x')g'(u')}{\pi(x)g(u)} \left|\frac{\partial(x', u')}{\partial(x,u)}\right|\right\}\\ &=\min\left\{ \pi(x)g(u), \pi(x')g'(u') \left|\frac{\partial(x', u')}{\partial(x,u)}\right| \right\}\\ &= \min\left\{\pi(x)g(u) \left|\frac{\partial(x, u)}{\partial(x',u')}\right|, \pi(x')g(u')\right\} \left|\frac{\partial(x', u')}{\partial(x,u)}\right|\\ &= \pi(x')g(u') \min\left\{\frac{\pi(x)g(u)}{\pi(x')g(u')} \left|\frac{\partial(x, u)}{\partial(x',u')}\right|, 1\right\} \left|\frac{\partial(x', u')}{\partial(x,u)}\right|\\ &= \pi(x')g'(u')\alpha(x', x)\left|\frac{\partial(x', u')}{\partial(x,u)}\right|, \end{align*}

where the third line uses the property that the Jacobian matrix of hh', the inverse transformation, is the inverse of the Jacobian of hh. This proves that the choice of α\alpha given above yields detailed balance.

To connect this approach to traditional MH notation, which uses a proposal density qq, we note that we can construct such a density just by integrating over uu:

q(xx)=u:h(x,u)x×Rrg(u)du)q(x' | x) = \int_{u:h(x,u) \in x' \times \mathbb{R}^{r'}} g(u)du)

Multiple move types

If we have multiple types of moves, we can allow the move taken to depend on the current state xx. Formally, let MM be a countable set of move types indexed by mm, where we let jm(x)j_m(x) indicate the probability of applying move type mm in state xx, and each move is formally defined by some distribution gm(u)g_m(u) and proposal function hm(x,u)h_m(x,u). Now the probability of moving from region AA to region BB of the state space is given by

m(x,x)A×Bπ(x)gm(u)jm(x)αm(x,x)dxdu\sum_m \int_{(x, x') \in A \times B}\pi(x)g_m(u)j_m(x)\alpha_m(x, x')dxdu

and thus the detailed balance requirement is

m(x,x)A×Bπ(x)gm(u)jm(x)αm(x,x)dxdu=m(x,x)A×Bπ(x)gm(u)jm(x)αm(x,x)dxdu.\sum_m \int_{(x, x') \in A \times B}\pi(x)g_m(u)j_m(x)\alpha_m(x, x')dxdu = \sum_m \int_{(x, x') \in A \times B}\pi(x')g'_m(u')j_m(x')\alpha_m(x', x)dx'du'.

It's clearly sufficient to show that detailed balance holds for each individual move type, which we can do by choosing the appropriate acceptance probability, similarly to above but now incorporating the move type probability:

αm(x,x)=min{1,π(x)jm(x)gm(u)π(x)jm(x)gm(u)(x,u)(x,u)}.\alpha_m(x, x') = \min\left\{1, \frac{\pi(x')j_m(x')g_m'(u')}{\pi(x)j_m(x)g_m(u)} \left|\frac{\partial(x', u')}{\partial(x,u)}\right|\right\}.

Note that here the Jacobian term is with regard to hm(x,u)h_m(x,u) and thus is implicitly specific to the move type in question. Also note that the move probabilities jmj_m do not necessarily have to sum to 1. If they don't, then the remaining probability mass corresponds to no move being attempted.

Note that this approach is equivalent to taking the move type as an auxiliary variable, as discussed above in the "Random Choice of Proposal" section (with the caveat that in the current exposition we're assuming that the state consists only of continuous variables). To see this, note that jm(x)j_m(x) is defined to mean p(mx)p(m|x), so we have

q(xx,m)jm(x)=q(xx,m)p(mx)=q(x,mx)q(x'|x, m)j_m(x) = q(x'|x,m)p(m|x) = q(x',m|x)

and

π(x)jm(x)=π(x)p(mx)=p(x,m)\pi(x)j_m(x) = \pi(x)p(m|x) = p(x, m)

so both our proposals and acceptance probabilities can be written in terms of the joint state (x,m)(x,m). Alternately, we can think of this as an extension of MH: just as MH extends the Metropolis algorithm, allowing asymmetric proposals by explicitly incorporating the proposal probabilities into the acceptance rate, we can extend MH to allow asymmetric move probabilities, of the form p(mx)p(mx)p(m | x) \ne p(m | x'), using the same trick of just incorporating the probabilities into the acceptance rate in order to cancel out the asymmetry.

Discrete variables

If our state includes discrete variables, we can represent the transitions between these values as different moves. For a general discrete state space, we let mkm_{k'} denote the move which sets value kk', then we can represent the proposal distribution q(kk)q(k'|k) by jmk(j)=q(kk)j_{m_{k'}}(j) = q(k' | k). If our space also has a continuous component, then we treat that component independently of the discrete component by taking the Cartesian product of the continuous moves we want with the discrete moves just defined. But this isn't necessary in general; we can define a joint proposal on the continuous and discrete components by defining a different set of continuous moves for each discrete move.

Changing dimension

add stuff about n+r = n'+r' requirement in order to be a diffeomorphism.

Open-world Inference

A special case of interest is that of a world containing an unknown number of objects, each described by some dd-dimensional parameter vector, so that a world with kk objects is described by D=kdD = kd parameters. Here our state is given by a vector x=[k,x11,x12,,x21,,xkd]x = [k, x_{11}, x_{12}, \ldots, x_{21}, \ldots, x_{kd}] describing a world of kk objects, and we want to construct a Markov chain that samples from a distribution p(x)p(x).

A simple example of this type is a Gaussian mixture model in which we have an unknown number of mixture components, each parameterized by its mean location (we assume for simplicity that all components have the same weight and the same covariance matrix).

In an open-world scenario, we can imagine a few specific types of moves:

  • Birth/Death. A birth move creates a new object and initializes its parameters according to some random proposal. In the simplest form, we might think of gg itself as being a proposal distribution for a new object, in which case the birth move (x,u)=h(x,u)(x', u') = h(x, u) looks like
    x=[k+1,x11,,xkd,u1,,ud]u=[],\begin{align*} x' &= [k+1, x_{11},\ldots, x_{kd}, u_1, \ldots, u_d]\\ u' &= [], \end{align*}
    However, in general we want our new object proposals to be allowed to depend on the current objects. To do this, we go back to thinking of gg just as a source of random bits (or more realistically, samples from well-known distributions like a standard uniform, standard Gaussian, etc), and move the hard work into the proposal function hh.

The inverse of a birth move is a death move, which deletes an object. Here we encounter a subtlety: the vector representation used above confers a notion of ordering on the objects in the world, i.e. it is formally different to write [obj1,obj2][\text{obj}_1, \text{obj}_2] than to write [obj2,obj1][\text{obj}_2, \text{obj}_1]. The simple birth move above always adds the birthed object as the "rightmost" object, so its inverse would always kill the rightmost object. Although this is formally sufficient to maintain irreducibility -- we can always get from one state to another state by killing all the objects of the first state, then birthing all the objects of the other state -- we probably actually want to be able to kill arbitrary objects, not just the most recently-created one. Thus our death move requires a choice of which object to kill.

warning: below this point is a bunch of rambling, which does resolve the issue in question, but needs a lot of cleanup to be presentable to anyone.

We can implement this formally using the notion of discrete states discussed above: for each integer k0k \ge 0, we define a birth/death move pair, in which the birth move inserts the new object at position kk, and the death move kills the object at position kk. In a state containing nn objects, we define jmj_m such that all birth moves up to k=nk = n inclusive have equal probability

okay, the confusing thing about the paper is that it doesn't say how to choose between applying forward and backward moves. it says a move takes us from (k, x) to (k', x'), and the backwards move does the opposite. one approach is to have formally different birth moves for going from 1 to 2 objects, 2 to 3 objects, etc. then part of our "k" is the number of objects currently in the world. so at any given point, most of our moves aren't applicable: only the ones which result in kk objects can be inverted, and only the ones which start with k objects can be pushed forward.

this is essentially using the discrete side to to the bookkeeping. what about moves that don't change k? in a fixed-dimensional space…

sometimes we have a proposal whose inverse is the same as itself (e.g. a symmetric Gaussian move), so it doesn't matter how we choose. but now suppose we have a "move-right" proposal which samples from a positive Gaussian, and its "move-left" inverse. if we sample this move type, what happens? do we move left or right? one answer which I think is formally consistent, is to define two versions of the move, one for the move-right (whose inverse is move left) and the other for the move-left (whose inverse is the move-right). Then the convention can be that if both the forward and inverse moves are possible, we just always use the forward move. This lets us set whatever probabilities we want inside of jmj_m. This is kind of an ugly hack, but I don't see another good way to do it… Actually, one easy fix is just to let jmj_m range over move directions as well as move types. This is equivalent to just adding another term that assigns probabilities to each move direction. In most cases only one move direction will have any probability, so this term essentially just disappears.

This really is intuitively weird though. The formalism explicitly includes the reverse move, so it doesn't make sense to just ignore it and say we always do forward moves (but there just happens to be another forward move identical to the current reverse move).

Of course, we can always just use standard MH moves when the discrete state is not changing. The whole reason to separate the forward and reverse moves is that they act on different domains. If they have the same domain, you can wrap them both up in a single proposal function q(xx)q(x'|x) since this can just as easily be flipped to interrogate q(xx)q(x|x').

One thing to keep in mind: at the bottom of all this is the construction of a Markov chain. So if I ever get really confused, just think of the chain we're constructing.

end rambling part 1

It's important to think of the interaction between the birth/death moves and other within-dimension moves. For example, if all birth move set a particular parameter to exactly zero, then it will be impossible to kill any object for which that parameter is nonzero. A general principle is that the objects resulting from a birth move should be supported on the entire space of objects that might conceivably be arrived at through within-dimension moves, since otherwise it is possible to arrive at unkillable objects.

Another (hopefully obvious) point in constructing valid birth moves is that our proposals have to be probabilistic: we can't just propose something arbitrarily, but for whatever we propose we must be able to calculate the probability of having proposed it.

  • **Split/Merge.

Auxiliary Variable methods

Note: this is following Storvik's "On the Flexibility of Metropolis–Hastings Acceptance Probabilities in Auxiliary Variable Proposal Generation".

We want to make an initial proposal q(x1y)q(x^*_1 | y), then run a few MH steps to fine-tune that proposal into our final proposal q(yy)q(y^*|y). Unfortunately it's not obvious how to directly compute the density q(yy)q(y^* | y). As an alternative, we can think of the intermediate states achieved by the MH steps as auxiliary variables xx^*, so our proposal is really q(x,yy)=q(xy)q(yx,y)q(x^*, y^* | y) = q(x^*|y)q(y^*|x^*,y) where q(xy)=q(x1y)i=2tqMH(xixi1)q(x^*|y) = q(x^*_1|y)\prod_{i=2}^t q_{MH}(x^*_{i} | x^*_{i-1}) where qMHq_{MH} denotes an MH step.

Since we're include the xx^* as auxiliary variables, we need a joint target distribution π(y,x)\pi(y^*, x^*). We define this as π(y)h(xy)\pi(y^*)h(x^*|y^*) where hh is some distribution we get to choose in order to maximize the acceptance rate. Note that this has marginal distribution π(y)\pi(y^*), so the yy's we sample will in fact be correctly distributed. For the reverse move, we must use the same target distribution π(y)h(xy)\pi(y)h(x|y), though we're free to choose a different proposal distribution q(x,yx,y)q(x,y|x^*, y^*) if we like. This gives acceptance ratio

min(1,π(y,x)q(x,yx,y)π(y,x)q(x,yx,y))=min(1,π(y)h(xy)q(xx,y)q(yx,xmy)π(y)h(xy)q(xx,y)q(yx,x,y)).\min\left(1, \frac{\pi(y^*, x^*) q(x, y | x^*, y^*)}{\pi(y, x) q(x^*, y^* | x, y)}\right) = \min\left(1, \frac{\pi(y^*)h( x^* | y^*) q(x | x^*, y^*) q(y|x, x^*m y^*)}{\pi(y)h(x | y) q(x^* | x, y)q(y^*|x^*, x, y)}\right).

Note that, in this approach, the old values of the auxiliary variables xx also appear in the acceptance ratio, so in principle we should store them. In practice, if it is possible to sample from hh, then we can lazily generate xx on demand, since the move that replaces a stored xx with a new xh(xy)x \sim h(x|y) is a Gibbs move that always accepts. So a move in this framework -- the "proposition 3 framework" -- looks like: first propose an initial x1x_1^*, then run MH for a while to generate xtx^*_t, then propose yxty^* | x^*_t, then sample xh(xy)x \sim h(x|y), and use these sampled values in the acceptance ratio above.

There is a slightly different approach, the "proposition 4 framework", which is a bit more flexible. Here we think of the joint state as (y,x)(y, x^*), i.e., we pretend to have already generated the xx^*'s we'll use for our next proposal, so the target density is π(y,x)=π(y)q(xy)\pi(y, x^*) = \pi(y)q(x^*|y) where q(xy)q(x^*|y) includes MH steps as defined above. Of course we do not need to explicitly store xx^* since we can Gibbs-generate it on demand by sampling from q(xy)q(x^*|y).

In this setup, our joint proposal is q(y,xy,x)=q(yy,x)h(xy,y,x)q(y^*, x | y, x^*) = q(y^*|y, x^*)h(x|y^*, y, x^*), i.e., we sample yy^* given our (presampled) xx^*, and then sample new xx from some distribution hh. Note the difference in role of hh: here it is (formally speaking) part of the proposal, rather than the stationary distribution as above. The main practical difference is that now xhx\sim h can depend, not just on yy, but also xx^* and yy^*. For example, we could force x=xx=x^* by taking h=δ(xx)h = \delta(x-x^*).

Choices of hh

Storvik considers functions h(x1:ty)h(x_{1:t} | y) of the form

h(x1:txt+1)=i=1s1qi(xixi1)i=stqi+1(xixi+1)h(x_{1:t} | x_{t+1}) = \prod_{i=1}^{s-1} q_i(x_i | x_{i-1}) \prod_{i=s}^t q_{i+1}(x_i | x_{i+1})

in which we assume that qiq_i are MH moves for i>si > s, though the moves with isi \le s may be arbitrary. This is equivalent to a hypothesis that the first ss steps of xx are generated by our forward proposal distribution,There is some subtlety here, which is that our first xx^* value is usually dependent on the previous state yy, q1(xy)q_1(x^*|y). In the prop 3 setup, we are allowed to let xx^* depend on yy^*, so we are okay if q1(xy)=q1(xy)q_1(x^*|y)=q_1(x^*|y^*), that is, if the relevant information from yy is still preserved in yy^*, but otherwise we may need to choose some other q~1\tilde{q}_1 which will then not cancel out of the acceptance ratio. In the prop 4 setup, it seems like this shouldn't be an issue, but I am therefore a little confused about why Storvik bothers mentioning the issue with regard to prop 3. while the final tst-s steps are generated by running MHMH from the destination yy. The special case s=1s=1 corresponds to assuming the entire xx sequence is generated by MH from our destination, meaning that we can ignore the destination and just compute the weight function

w1(y;x)=π(x1q1(x1yw_1(y; x^*) = \frac{\pi(x^*_1}{q_1(x_1^* | y}

where the "weight function" is simply the component of the acceptance ratio coming from the forward move, with an analogous function for the backwards move; see the paper for more. This case says that we can essentially ignore the fact that we've run additional MH samples: we accept the final state yy^* of the MH sequence if and only if we would have accepted its start state x1x_1^*.

Another special case is s=ts=t, which gives weight

wt(y,x)=π(y)qt+1(yxt).w_t(y, x^*) = \frac{\pi(y^*)}{q_{t+1}(y^* | x_t^*)}.

Note that this requires us to be able to compute the density qt+1q_{t+1}, but does not depend at all on the intermediate states x1:tx_{1:t}; this is because for those variables we have defined h=qxh=q_x so (under either the prop 3 or prop 4 setups) they are being sampled from their stationary distribution.

More generally, we can use linear combinations of weight functions. TODO: I don't really understand why and the paper doesn't spell it out (except in the importance sampling section). This lets us use information from all of the MH samples, though the results in the paper don't seem to get much of a benefit from this. one answer: we get to choose hh, so we can define it as a mixture distribution, which I think then just spills out into a mixture of weight functions if everything else is held constant.

Mismatched Proposals

Suppose we have two proposal distributions q1q_1 and q2q_2. You could think of these as "smart" and "dumb" reversible jump proposals, but in the simplest case they could even just be random walk proposals from Gaussians with different standard deviations, e.g., σ1=1.0\sigma_1 = 1.0 and σ2=10.0\sigma_2=10.0 respectively.

It's clearly valid to propose from the mixture proposal q(xx)=12(q1(xx)+q2(xx))q(x'|x) = \frac{1}{2}(q_1(x'|x) + q_2(x'|x)) as long as we also compute the acceptance probability under the mixture. And of course it's valid to apply MH to each proposal individually, and then operate under the mixture kernel that first chooses a proposal distribution and then computes the acceptance ratio specific to that proposal.

But it's also valid to do something a bit trickier. Consider the "mismatched" acceptance ratios

α12(xx)=min(1,π(x)q2(xx)π(x)q1(xx))α21(xx)=min(1,π(x)q1(xx)π(x)q2(xx))\begin{align*} \alpha_{12}(x'|x) &= \min\left(1, \frac{\pi(x')q_2(x|x')}{\pi(x)q_1(x'|x)}\right)\\ \alpha_{21}(x'|x) &= \min\left(1, \frac{\pi(x')q_1(x|x')}{\pi(x)q_2(x'|x)}\right) \end{align*}

in which we propose from one distribution but compute the reverse probability under the other distribution. We can use these to construct MH transition kernels

T12(xx)=q1(xx)α12(xx)+δ(x=x)r12(x)T21(xx)=q2(xx)α21(xx)+δ(x=x)r21(x)\begin{align*} T_{12}(x'|x) &= q_1(x'|x)\alpha_{12}(x'|x) + \delta(x'=x)r_{12}(x)\\ T_{21}(x'|x) &= q_2(x'|x)\alpha_{21}(x'|x) + \delta(x'=x)r_{21}(x) \end{align*}

where r12r_{12} and r21r_{21} are rejection probabilities defined analogously to Section~\ref{sec:mh} above. These transition kernels will not individually have detailed balance, but the mixture kernel,

T(xx)=12(T12(xx)+T21(xx))T(x'|x) = \frac{1}{2}\left(T_{12}(x'|x) + T_{21}(x'|x)\right)

in which we choose at random which distribution to propose from, but then compute a "mismatched" acceptance probability, does in fact have detailed balance.

To show this, we need to prove

π(x)T(xx)=π(x)T(xx)\pi(x')T(x|x') = \pi(x)T(x'|x)

or equivalently

T(xx)T(xx)=π(x)π(x)\frac{T(x'|x)}{T(x|x')} = \frac{\pi(x')}{\pi(x)}

for all pairs (x,x)(x, x'). This is trivially true when x=xx=x', so we need only consider the case xxx \ne x'. Let

a=π(x)q2(xx)π(x)q1(xx)a=π(x)q1(xx)π(x)q2(xx)\begin{align*} a &= \frac{\pi(x')q_2(x|x')}{\pi(x)q_1(x'|x)}\\ a' &= \frac{\pi(x')q_1(x|x')}{\pi(x)q_2(x'|x)} \end{align*}

denote the inner parts of the acceptance ratios α12(xx)\alpha_{12}(x'|x) and α21(xx)\alpha_{21}(x'|x) respectively. Then assuming xxx \ne x' we have

T(xx)T(xx)=q1(xx)α12(xx)+q2(xx)α21(xx)q1(xx)α12(xx)+q2(xx)α21(xx)=q1(xx)min(1,a)+q2(xx)min(1,a)q1(xx)min(1,1/a)+q2(xx)min(1,1/a)\begin{align*} \frac{T(x'|x)}{T(x|x')} &= \frac{q_1 (x'|x)\alpha_{12}(x'|x) + q_2(x'|x) \alpha_{21}(x'|x) }{q_1(x|x') \alpha_{12}(x|x') + q_2(x|x') \alpha_{21}(x|x')}\\ &= \frac{q_1 (x'|x)\min(1, a) + q_2(x'|x) \min(1, a') }{q_1(x|x') \min(1, 1/a') + q_2(x|x') \min(1, 1/a)} \end{align*}

This expression is annoying due to all the mins, so we break it into four cases: (a1,a1),(a1,a<1),(a<1,a1)(a \ge 1, a' \ge 1), (a \ge 1, a' < 1), (a < 1, a' \ge 1), and (a<1,a<1)(a < 1, a' < 1). The first and second two cases are symmetric wrt swapping xx and xx', so we need consider only the first two. In the first case (a1,a1)(a \ge 1, a' \ge 1), we have

T(xx)T(xx)=q1(xx)+q2(xx)q1(xx)1/a+q2(xx)1/a=q1(xx)+q2(xx)π(x)π(x)q2(xx)+π(x)π(x)q1(xx)=π(x)π(x)\begin{align*} \frac{T(x'|x)}{T(x|x')} &=\frac{q_1 (x'|x) + q_2(x'|x) }{q_1(x|x') \cdot 1/a' + q_2(x|x') \cdot 1/a}\\ &=\frac{q_1 (x'|x) + q_2(x'|x) }{\frac{\pi(x)}{\pi(x')}q_2(x'|x) + \frac{\pi(x)}{\pi(x')}q_1(x'|x) }\\ &= \frac{\pi(x')}{\pi(x)} \end{align*}

thus establishing detailed balance in this case. In the second case (a1,a<1)(a \ge 1, a'< 1), we have

T(xx)T(xx)=q1(xx)+q2(xx)aq1(xx)+q2(xx)1/a=q1(xx)+π(x)π(x)q1(xx)q1(xx)+π(x)π(x)q1(xx)=π(x)π(x)(π(x)π(x)q1(xx)+q1(xx)q1(xx)+π(x)π(x)q1(xx))=π(x)π(x),\begin{align*} \frac{T(x'|x)}{T(x|x')} &=\frac{q_1 (x'|x) + q_2(x'|x) \cdot a' }{q_1(x|x') + q_2(x|x') \cdot 1/a}\\ &=\frac{q_1(x'|x) + \frac{\pi(x')}{\pi(x)} q_1(x|x') }{q_1(x|x') + \frac{\pi(x)}{\pi(x')}q_1(x'|x) }\\ &= \frac{\pi(x')}{\pi(x)} \left( \frac{\frac{\pi(x)}{\pi(x')} q_1(x'|x) + q_1(x|x') }{q_1(x|x') + \frac{\pi(x)}{\pi(x')}q_1(x'|x) } \right)\\ &= \frac{\pi(x')}{\pi(x)}, \end{align*}

and together with the previous case (and the symmetric cases) this establishes detailed balance for our mixture kernel.

This argument establishes the validity of a wide range of reversible jump flavored proposals without the need to split hairs regarding the domain of individual moves ("this qq distribution is a birth on states with an odd number of objects but a death on states with an even number") or rely on a degenerate mixture proposal (propose from a mixture of the birth and death move, where for any pair of states one or the other will have zero probability since the dimensions won't match). It means we can feel free in smashing together any two moves q1q_1 and q2q_2 as the "reversals" of each other, without having to contort ourselves into constructing any single proposal that actually reverses itself.

Note that for reversible jump moves it will often be the case that one of q1(xx)q_1(x'|x) and q2(xx)q_2(x'|x) is zero, for example, if q1q_1 is a birth move and q2q_2 is a death move. It is straightforward to check that the derivation goes through in this case.

TODO: It feels like there should be a more "intuitive" explanation of this fact that doesn't rely on multiple cases of algebraic manipulations of transition probabilities.