minimum description length: Nonlinear Function
Created: April 12, 2022
Modified: April 12, 2022

minimum description length

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Short descriptions of things, when they exist, must capture some kind of structure. The principle of Occam's razor posits that we should prefer models that better cover the data.

Two-part codes

In statistics and machine learning we think of data compression via a two-part code. Any hypothesis θ\theta can represent the data xx using a description of length

L(x)=L(θ)+L(xθ)L(x) = L(\theta) + L(x | \theta)

which sums the description length of the hypothesis, L(θ)L(\theta), with the 'leftover' information in the data given the hypothesis. In a probabilistic setting we can write this formally (applying the Kraft inequality) as

Lq(x)=logq(θ=f(x))+logq(xθ=f(x))L_q(x) = \lceil-\log q(\theta=f(x))\rceil + \lceil-\log q(x | \theta=f(x))\rceil

where we require that our hypothesis θ\theta is a unique function f(x)f(x) of the data xx we are trying to represent (it can't depend on any 'extra' information). Both the hypothesis and the 'leftover' bits are encoded according to a distribution qq that we control. Note that it's sufficient to specify q(x)q(x) since this induces the joint distribution q(x,θ)=q(x)δ(θ=f(x))q(x, \theta) = q(x)\delta(\theta=f(x)) pushing xx forward through ff.

Although we get to choose the coding distribution qq, generally nature presents the data to us according to a distribution p(x)p(x) that we do not control. The expected code length is therefore the cross-entropy

H(p,q)=Ep[logq(x)].H(p, q) = -\mathbb{E}_p[\log q(x)].

We can now think about choosing the model qq that minimizes this quantity. Using the identity H(p,q)=H(p)+KL(pq)H(p, q) = H(p) + KL(p\|q) it is clear that we should optimally pick q(x)=p(x)q(x)=p(x).

Bits-back coding

Instead of a deterministic code θ=f(x)\theta = f(x) we can use an arbitrary probabilistic code, θr(x)\theta\sim r(\cdot | x). Equivalently one could write this as θ=f(x,ϵ)\theta = f(x, \epsilon) in which the hypothesis choice now depends on additional random bits ϵ\epsilon.

Encodings of θ\theta and xθx|\theta now transmit not just the information in xx but also the information used to select θ\theta given xx. And the receiver can in fact recover ϵ\epsilon, by first decoding θ\theta and xθx|\theta in turn, then following the same process as the sender to construct the distribution r(θx)r(\theta | x), and finally decoding ϵ\epsilon from θ\theta under this distribution.Equivalently, she can invert θ=f(x,)\theta = f(x, \cdot), where the inverse is defined by the assumption that ϵ\epsilon itself is non-redundant. The amount of extra information transmitted is logr(θx)-\log r(\theta | x), or on average the entropy H(q(θx))H(q(\theta | x)).

To review the distributions operating here, we have:

  1. A data distribution p(x)p(x) fixed by nature.
  2. A model q(x,θ)=q(θ)q(xθ)q(x, \theta) = q(\theta)q(x|\theta) used to encode and decode the hypothesis and data (θ,x)(\theta, x).
  3. A 'hypothesis choice' distribution r(θx)r(\theta|x), used by the sender to choose which hypothesis to transmit, and by the receiver to decode the additional bits ϵ\epsilon .

We can view rr as an extension of the data distribution pp: just as p(x)p(x) represents the distribution under which xx is generated, r(θx)r(\theta|x) is (by construction) the distribution under which θ\theta is generated, so p(x)r(θx)p(x)r(\theta|x) is the joint data-generating process. Similarly to the previous section, to minimize code length we would ideally choose q(θ,x)=p(x)r(θx)q(\theta, x) = p(x)r(\theta | x), implying

q(x)=p(x)q(θx)=r(θx)\begin{align*} q(x) &= p(x)\\ q(\theta | x) &= r(\theta | x) \end{align*}

Since p(x)p(x) is fixed by nature, we really only have the freedom to choose rr. This generalizes our ability in the previous section to choose the deterministic function ff.

To isolate the description length of xx, we should subtract the recoverable information transmitted in ϵ\epsilon:

Lq,r(x)=Er(θx)[logq(θ)logq(xθ)+logr(θx)]=Er(θx)[logq(θx)logq(x)+logr(θx)]=DKL(r(θx),q(θx))logq(x)\begin{align*} L_{q, r}(x) &= \mathbb{E}_{r(\theta|x)} \left[-\log q(\theta) - \log q(x | \theta) + \log r(\theta | x)\right]\\ &= \mathbb{E}_{r(\theta|x)} \left[-\log q(\theta | x) - \log q(x) + \log r(\theta | x)\right]\\ &= D_\text{KL}(r(\theta|x), q(\theta|x)) - \log q(x) \end{align*}

Variational inference

Here I'll switch to more conventional notation. In Bayesian modeling we generally model the joint distribution p(θ,x)p(\theta, x) of latents and observed data. This model distribution pp corresponds to the coding distribution qq in the above, but we typically proceed as if our model is 'correct', assuming q=pq=p. Rewriting the above in this notation:

Lr(x)=DKL(r(θx),p(θx))logp(x)L_{r}(x) = D_\text{KL}(r(\theta|x), p(\theta|x)) - \log p(x)

and rearranging:

logp(x)=DKL(r(θx),p(θx))Lr(x)Lr(x)=Er(θx)[logp(θ,x)logr(θx)]\begin{align*} \log p(x) &= D_\text{KL}(r(\theta|x), p(\theta|x)) - L_{r}(x)\\ &\ge -L_r(x)\\ &= \mathbb{E}_{r(\theta|x)} \left[\log p(\theta, x) - \log r(\theta | x)\right] \end{align*}

we recover the standard evidence lower bound (ELBO). This justifies the interpretation of maximizing this bound (in variational inference) as minimizing the description length of the data.