sparse coding: Nonlinear Function
Created:
Modified:

sparse coding

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

paper 1: http://redwood.berkeley.edu/bruno/papers/VR.pdf

basic idea: find a basis such that any given image (or whatever signal) can be reconstructed accurately by adding together just a few of the basis vectors.

step 1: find the basis step 2: for any given image, find its sparse representation.

typically a sparse basis is overcomplete: the number of basis vectors is larger than the dimensionality of the space, so the basis is not linearly independent and a given image might have many possible encodings. This is what makes step 2 nontrivial: if we had a linearly independent basis, we can just project the image onto each of the basis elements to get the coefficients.

why an overcomplete representation? think of the generative process that created the image. Suppose you have a 1000x1000 pixel image, so your image space is million-dimensional. But no one belives that the image is being generated by exactly a million different, independent, causes each happening to various degrees. Instead there are way more than a million possible causes for the pixels in your image -- e.g., the image might contain a chair, or a parakeet, or a bowl of fruit, or any of many other objects, in many possible locations -- but in any given image, only a tiny subset (say, ten) of these causes are actually at work. So an overcomplete basis, in which any given natural image can be represented by a sparse set of basis vectors, seems like a good model for natural images.1

1 Note that sparse coding applied to natural images tends to find lower-level image features, like edges or textural motifs, rather than objects like chairs or parakeets. This is because we model the images as a linear superposition of basis vectors, whereas the superposition of objects in an image is nonlinear -- if I have a parakeet sitting on a chair, the image I observe is not the sum of the chair pixels and the parakeet pixels, because the parakeet will occlude the chair -- and the object appearance also depends on many others factors like position, orientation, and scale which are not modeled in this approach. The basic intuition is the same, though.

Olshausen and Field formulate this as a generative model in which an image is modeled as a linear superposition of basis elements (these basis elements are also called a dictionary), plus i.i.d. Gaussian noise, with a flat prior on the dictionary entries, and a sparsity-inducing prior on the coefficients. They use a Cauchy distribution but note that a Laplacian also works (TODO: more on why these priors are sparsity-inducing, which ones are now more popular, relation to the lasso, etc). They then recover the maximum-likelihood set of basis elements jointly with coefficients for each image.

What do they find? TODO: show images and discuss the results.

Is this actually useful? TODO: follow up on applications to vision -- are these actually useful features for any given task? (probably yes: see refs from intro of LISTA paper)

TODO: connection to ICA, Lasso, etc.

problem: to get sparse-coded features for natural images, we have to solve an iterative optimization problem for every patch of the image.

paper 2: ISTA? something intermediate.

how do we efficiently solve this optimization problem?

introduce the idea of a 'shrinkage function'.

there is a 'mutual inhibition matrix' which penalizes entries according to ???

paper 3: http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf

what if we don't solve the problem at all, and instead just train a neural net to give us the answers? (at the same time, find a sparse code that can be well-predicted by a neural net...)

one big idea: any iterative update rule can be 'unrolled' into a neural net.

but, if you're just going to run your 'update rule' for a few steps, you don't necessarily want the same updates as if you were running it to convergence. so the next idea is: you can train the specific updates (including the 'mutual inhibition' and 'filter' terms as well as the shrinkage thresholds) to get the best performance for whatever fixed number of updates you have the computational budget to run.

-- they also try learning a nonlinearity (shrinkage function), but end up just finding that a piecewise linear function that zeroes out near the origin (same function as used by ISTA) is as good as anything they learn.

there's another idea, of trying to learn a dictionary specifically designed to be well-predicted by your reconstruction network. that's not in this paper though (they learn dictionaries as a separate step before trying to recover sparse codes). where is it from?