Created: April 12, 2022
Modified: April 12, 2022

minimum description length

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Short descriptions of things, when they exist, must capture some kind of structure. The principle of Occam's razor posits that we should prefer models that better cover the data.

Two-part codes

In statistics and machine learning we think of data compression via a two-part code. Any hypothesis $\theta$ can represent the data $x$ using a description of length

L(x) = L(\theta) + L(x | \theta)

which sums the description length of the hypothesis, $L(\theta)$ , with the 'leftover' information in the data given the hypothesis. In a probabilistic setting we can write this formally (applying the Kraft inequality) as

L_q(x) = \lceil-\log q(\theta=f(x))\rceil + \lceil-\log q(x | \theta=f(x))\rceil

where we require that our hypothesis $\theta$ is a unique function $f(x)$ of the data $x$ we are trying to represent (it can't depend on any 'extra' information). Both the hypothesis and the 'leftover' bits are encoded according to a distribution $q$ that we control. Note that it's sufficient to specify $q(x)$ since this induces the joint distribution $q(x, \theta) = q(x)\delta(\theta=f(x))$ pushing $x$ forward through $f$ .

Although we get to choose the coding distribution $q$ , generally nature presents the data to us according to a distribution $p(x)$ that we do not control. The expected code length is therefore the cross-entropy

H(p, q) = -\mathbb{E}_p[\log q(x)].

We can now think about choosing the model $q$ that minimizes this quantity. Using the identity $H(p, q) = H(p) + KL(p\|q)$ it is clear that we should optimally pick $q(x)=p(x)$ .

Bits-back coding

Instead of a deterministic code $\theta = f(x)$ we can use an arbitrary probabilistic code, $\theta\sim r(\cdot | x)$ . Equivalently one could write this as $\theta = f(x, \epsilon)$ in which the hypothesis choice now depends on additional random bits $\epsilon$ .

Encodings of $\theta$ and $x|\theta$ now transmit not just the information in $x$ but also the information used to select $\theta$ given $x$ . And the receiver can in fact recover $\epsilon$ , by first decoding $\theta$ and $x|\theta$ in turn, then following the same process as the sender to construct the distribution $r(\theta | x)$ , and finally decoding $\epsilon$ from $\theta$ under this distribution.Equivalently, she can invert $\theta = f(x, \cdot)$ , where the inverse is defined by the assumption that $\epsilon$ itself is non-redundant. The amount of extra information transmitted is $-\log r(\theta | x)$ , or on average the entropy $H(q(\theta | x))$ .

To review the distributions operating here, we have:

A data distribution $p(x)$ fixed by nature.
A model $q(x, \theta) = q(\theta)q(x|\theta)$ used to encode and decode the hypothesis and data $(\theta, x)$ .
A 'hypothesis choice' distribution $r(\theta|x)$ , used by the sender to choose which hypothesis to transmit, and by the receiver to decode the additional bits $\epsilon$ .

We can view $r$ as an extension of the data distribution $p$ : just as $p(x)$ represents the distribution under which $x$ is generated, $r(\theta|x)$ is (by construction) the distribution under which $\theta$ is generated, so $p(x)r(\theta|x)$ is the joint data-generating process. Similarly to the previous section, to minimize code length we would ideally choose $q(\theta, x) = p(x)r(\theta | x)$ , implying

\begin{align*} q(x) &= p(x)\\ q(\theta | x) &= r(\theta | x) \end{align*}

Since $p(x)$ is fixed by nature, we really only have the freedom to choose $r$ . This generalizes our ability in the previous section to choose the deterministic function $f$ .

To isolate the description length of $x$ , we should subtract the recoverable information transmitted in $\epsilon$ :

\begin{align*} L_{q, r}(x) &= \mathbb{E}_{r(\theta|x)} \left[-\log q(\theta) - \log q(x | \theta) + \log r(\theta | x)\right]\\ &= \mathbb{E}_{r(\theta|x)} \left[-\log q(\theta | x) - \log q(x) + \log r(\theta | x)\right]\\ &= D_\text{KL}(r(\theta|x), q(\theta|x)) - \log q(x) \end{align*}

Variational inference

Here I'll switch to more conventional notation. In Bayesian modeling we generally model the joint distribution $p(\theta, x)$ of latents and observed data. This model distribution $p$ corresponds to the coding distribution $q$ in the above, but we typically proceed as if our model is 'correct', assuming $q=p$ . Rewriting the above in this notation:

L_{r}(x) = D_\text{KL}(r(\theta|x), p(\theta|x)) - \log p(x)

and rearranging:

\begin{align*} \log p(x) &= D_\text{KL}(r(\theta|x), p(\theta|x)) - L_{r}(x)\\ &\ge -L_r(x)\\ &= \mathbb{E}_{r(\theta|x)} \left[\log p(\theta, x) - \log r(\theta | x)\right] \end{align*}

we recover the standard evidence lower bound (ELBO). This justifies the interpretation of maximizing this bound (in variational inference) as minimizing the description length of the data.

minimum description length

Two-part codes

Bits-back coding

Variational inference

Links to this note

Occam's razor

representation

many models

Meta