Modified: April 12, 2022
minimum description length
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Short descriptions of things, when they exist, must capture some kind of structure. The principle of Occam's razor posits that we should prefer models that better cover the data.
Two-part codes
In statistics and machine learning we think of data compression via a two-part code. Any hypothesis can represent the data using a description of length
which sums the description length of the hypothesis, , with the 'leftover' information in the data given the hypothesis. In a probabilistic setting we can write this formally (applying the Kraft inequality) as
where we require that our hypothesis is a unique function of the data we are trying to represent (it can't depend on any 'extra' information). Both the hypothesis and the 'leftover' bits are encoded according to a distribution that we control. Note that it's sufficient to specify since this induces the joint distribution pushing forward through .
Although we get to choose the coding distribution , generally nature presents the data to us according to a distribution that we do not control. The expected code length is therefore the cross-entropy
We can now think about choosing the model that minimizes this quantity. Using the identity it is clear that we should optimally pick .
Bits-back coding
Instead of a deterministic code we can use an arbitrary probabilistic code, . Equivalently one could write this as in which the hypothesis choice now depends on additional random bits .
Encodings of and now transmit not just the information in but also the information used to select given . And the receiver can in fact recover , by first decoding and in turn, then following the same process as the sender to construct the distribution , and finally decoding from under this distribution.Equivalently, she can invert , where the inverse is defined by the assumption that itself is non-redundant. The amount of extra information transmitted is , or on average the entropy .
To review the distributions operating here, we have:
- A data distribution fixed by nature.
- A model used to encode and decode the hypothesis and data .
- A 'hypothesis choice' distribution , used by the sender to choose which hypothesis to transmit, and by the receiver to decode the additional bits .
We can view as an extension of the data distribution : just as represents the distribution under which is generated, is (by construction) the distribution under which is generated, so is the joint data-generating process. Similarly to the previous section, to minimize code length we would ideally choose , implying
Since is fixed by nature, we really only have the freedom to choose . This generalizes our ability in the previous section to choose the deterministic function .
To isolate the description length of , we should subtract the recoverable information transmitted in :
Variational inference
Here I'll switch to more conventional notation. In Bayesian modeling we generally model the joint distribution of latents and observed data. This model distribution corresponds to the coding distribution in the above, but we typically proceed as if our model is 'correct', assuming . Rewriting the above in this notation:
and rearranging:
we recover the standard evidence lower bound (ELBO). This justifies the interpretation of maximizing this bound (in variational inference) as minimizing the description length of the data.