representation: Nonlinear Function
Created: January 27, 2022
Modified: February 11, 2022

representation

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • In modern ML, representation learning is the art of trying to find useful abstractions, embodied as encoding networks.
  • We can learn representations using a supervised objective, an unsupervised or self-supervised objective, or conceivably an RL / decision-based objective.
  • How do we know if a representation is good? Typically, the reason to do representation learning is to support transfer learning: to use the representation as a source of features for a new task.
    • Given a few examples for a new task, is it better to use an existing representation, or to fine-tune the entire network from end to end?
      • End-to-end tuning obviously offers the most flexibility.
      • But fixing the existing representation is a form of regularization. It allows you to learn from a small additional dataset without overfitting. A 'soft' version of this would be to use a much lower learning rate for the parts of the parts of the network that generate the existing representation.
      • From an systems design standpoint, using an fixed representation also allows for modularity of components.

Unsupervised representation learning

(warning: the following thoughts are rambling and confused)

One possible notion comes from thinking about lossless compression. We know from the Kraft inequality that any distribution p(xi)p(x_i) has an optimal encoding in which each state xix_i is represented using log2p(xi)-\log_2 p(x_i) bits, so that the expected description length is the entropy of the system.

This is meaningful if we have a nonuniform stationary distribution. e.g., if our system is a chamber containing a bouncing rubber ball, then states in which the ball continues to be a ball are much more likely than states in which its atoms are rearranged into a bird, so we can probably represent the system pretty well by just giving the position and momentum of the ball.

We could partition the system states into macrostates along the lines of 'ball is in position A', …, 'ball is in position Z', and a final catchall 'ball has exploded/is a bird/etc.'. Just giving the macrostate y=f(x)y = f(x) corresponds to a lossy compression, where to identify a specific microstate we then need to specify an additional H(xy)H(x | y) bits (the conditional entropy). Say that the conditional entropy of all but the final catchall macrostate is close to zero: by specifying the position of the intact ball we've picked out a unique microstate. Meanwhile the conditional entropy of the final macrostate is giant.

I want to say that this partition is 'objectively good' in some sense, that it does carve the world at the joints. what's the right way to specify this?

I suspect the correct notion of expected description length is Hp(y)+E[HU(xy)]H_p(y) + \mathbb{E}[H_{\mathcal{U}}(x|y)]. Using the optimal code on macrostates, we conditionally code microstates according to the uniform distribution. This incentivizes us to put any 'structure' into the macrostates.

But then why not put everything into the macrostates? There needs to be some capacity control: how much do we really care about having a succinct description of the system?

For any upper limit BB on the allowed entropy of the macrostate distribution, enforcing the constraint Hp(y)B<Hp(x)H_p(y) \le B < H_p(x), we'll be forced to put some structure into the microstates. And then we're incentivized to put the 'most uniform' structure there.

I think the way to think of it is: suppose I'm going to transmit to you multiple world (micro)states. The way it's going to work is that I'll first send you a code that represents a distribution on macrostates. I won't bother sending a code for microstates-given-macrostates. Instead, I'll just transmit the series of macrostate representations, and then code the 'leftover' information in the microstate using the most naive possible representation, which is the maximum-entropy (uniform) distribution conditioned on the macrostate.

Is this equivalent? Say I come up with the optimal prefix code given the distribution on microstates. There is some minimum length of codewords, corresponding roughly to the probabilities of the most-probable states. The prefix of this length contains the 'high-order bits', and any longer content corresponds to 'low-order bits'. Now in general you'd want to partition according to the high-order bits.

I think this gives a story for choosing a partition based only on minimum description length principles, without any other objective. (though in practice we may have other objectives)