structured prediction: Nonlinear Function
Created: May 29, 2020
Modified: March 03, 2022

structured prediction

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

In kindergarten stats, you learn how to build a model that takes in data (a feature vector, image, sound file, etc) and predicts a single continuous value (regression) or a discrete value (classification). It's not that hard to extend this machine to predict multiple continuous values (points in RdR^d), or probabilities of multiple discrete outcomes.

Structured prediction comes in when we want our model's range of outputs to be a combinatorically large space; for example, if we want our model to produce sentences, parse trees, graphs, image segmentations, part-of-speech labels, etc.

We can of course specify a model to directly generate output structures by some sequential process; for example, we can sample sentences directly from an autoregressive model. But it often makes senseWhy? Essentially because this converts a generative modeling task (producing a sentence) to a discriminative task (scoring a sentence). By offloading much of the work to the search algorithm, it reduces the load on the learning system. And good search algorithms are generally much better than whatever your model could learn on its own. (though this may not be true forever).) instead to train a model to produce a score for a structure, and then run some search or sampling algorithm to find the structure with the best score.

  • Examples:
    • The prototypical example involves learning potentials of an undirected graphical model. The 'structure' is the values at the nodes. The prediction is the lowest-energy configuration. In nice cases like trees we can sometimes find this configuration efficiently.
    • In AlphaFold, the model produces an energy function (aka score) of a configuration, and gradient descent is used to search for the best configuration.

Our formal setup will assume a score function s(x,y;w)s(x, y; w) which assigns a real-valued score to an (x,y)(x, y) pair according to weights ww. Given a new input xx^*, we generally want to predict the highest-scoring structure:

y=argmaxs(x,;w)y^* = \arg\max s(x^*, \cdot; w)

This is a nontrivial problem in general, since it involves searching over a combinatorically large space of structures. Problem-specific algorithms are required in general to do this efficiently. For now let's assume that we have this capability, and consider how to train the model, i.e., how to choose the weights ww.

Classic approach: structured perceptron

The classical setup (eg from pystruct) assumes that the score is defined to be a linear combination of of fixed 'feature functions' fi(x,y)f_i(x, y) defined on joint input/output pairs:

s(x,y;w)=iwiTfi(x,y).s(x, y; w) = \sum_i w_i^T f_i(x, y).

Intuitively the features fif_i should capture the 'compatibility' of the input/output pair along with the 'coherence' of the output on its own.

A simple training algorithm in this setting is the structured perceptron: given a training example (xk,yk)(x_k, y_k), we find the yk=argmaxs(xk,;w)y_k^* = \arg\max s(x_k, \cdot; w) that maximizes the score. If ykyky_k \ne y_k^*, we increment the weights by the difference between features,

w+=f(xk,yk)f(xk,yk)w \mathrel{+}= f(x_k, y_k^*) - f(x_k, y_k)

so that the true prediction becomes 'better' and the wrong prediction becomes worse. Otherwise we do nothing.

Probabilistic approach: maximum likelihood

More generally we can consider training following a maximum likelihood objective. Given weights ww and an input pair (x,y)(x, y), define the (conditional) likelihood of the structure yy as

logp(yx;w)=s(x,y;w)logZ(x;w),\log p(y | x; w) = s(x, y; w) - \log Z(x; w),

where Z(x;w)=yes(x,y;w)Z(x; w) = \sum_{y'} e^{s(x, y'; w)} is a normalizing constant ensuring that the probabilities sum to 1. We can then try to find the ww that maximizes the joint likelihood of all NN structures in our training set:

logp(YX;w)=k=1Nlogp(ykxk;w)\log p(Y | X; w) = \sum_{k=1}^N \log p(y_k | x_k; w)

Advantages of this formulation include:

  • We can use the optimization algorithm of our choice: stochastic gradient, batch gradient, Newton's method, etc.
  • We can easily extend the setup by adding additional terms to the loss.
  • The score function ss is no longer restricted to be linear-in-features; it can in principle be an arbitrary differentiable function (such as a deep network) allowing us to learn features by end-to-end optimization.

Note that this formulation replaces the need to predict the optimal yy^*, from the structured perceptron setup above, with the need to sum over all possible structures in order to evaluate the normalizing constant Z(x;w)Z(x; w). Doing either of these efficiently typically requires us to exploit problem-specific structure.

Example: conditional random fields

In the standard CRF setting, the fif_i are graphical model potentials arising as exponential family sufficient statistics, so the whole apparatus is an exponential family, and ZZ can often be tractably computed by message passing.

There may be values of ww for which the distribution doesn't normalize: the domain of an exponential family's canonical parameters is not always unlimited (for example, Gaussian precision matrices must be positive definite). For such invalid w's we will effectively have logZ(x,w)=\log Z(x, w) = \infty, so the score is infinitely low; thus optimization will 'do the right thing' to ensure that we find a normalized distribution.

Approximating the log normalizer

Even if we can't compute ZZ explicitly, we can sometimes get a stochastic approximation to its gradient. Recall that the gradient of the log normalizer is just the expected sufficient statistics (the mean parameter):

wlogZ(x;w)=Eyp(yx;w)[f(x,y)]\nabla_w -\log Z(x; w) = E_{y \sim p(y | x; w)}[\mathbf{f}(x, y)]

So if we could draw samples from p(yx;w)p(y | x; w), we would have a stochastic estimate of the gradient.

If we can't do this explicitly, we could perhaps use an approximate sampling method like MCMC. This is potentially expensive, but contrastive divergence points out that we don't actually need MCMC to converge, we just need something that pulls us in the right direction. So we can initialize at the true label yy^*, and then run just one or two MCMC steps to get a (biased) approximate gradient.

Q: Why do we need any MCMC steps at all? Why can't we just use the empirical distribution on structures yxy | x as a proxy for the model's distribution?

A: Because we're already doing that in the first term. Writing the full-data likelihood logp(Y,X,w)=1Nj=1NiwiTfi(x(j),y(j))\log p(Y, X, w) = \frac{1}{N} \sum_{j=1}^N \sum_i w_i^T f_i(x^{(j)}, y^{(j)}) we see that this is an expectation over the data distribution, and its gradient is just Ex,yD[ifi(x,y)]E_{x, y \sim D}\left[\sum_i f_i(x, y)\right]. Doing the same computation for the ZZ term would give the same result, and the gradient will always be zero. Running one or two MCMC steps gives you something that should tend to point noisily in the direction of the true gradient, although you'd need infinitely many steps to remove the bias.

Avoiding normalization by predicting structural features

AlphaFold uses a trick: instead of trying to define a normalized likelihood over protein structures, it defines a normalized likelihood over features of protein structures. Because we get to choose the feature representation, we can avoid the need to compute or approximate a tricky log-normalizer.

Concretely, AlphaFold computes some features gi(y)g_i(y) of each protein structure: these are things like residue distances and bond angles. Then it trains a model to jointly predict those features from the input sequence xx. This is standard supervised learning: we choose a parameterized likelihood model for logp(g(y)x;w)\log p(\mathbf{g}(y) | x; w), such as a deep network, and train it by stochastic gradient. Such a model will be normalized by construction: for example, it might output the mean and variance of individual Gaussian distributions for each residue pair.

The 'score' of a candidate structure at test time is just the log-likelihood of its features:

s(x,y;w)=logp(g(y)x;w)s(x, y; w) = \log p(\mathbf{g}(y) | x; w)

If everything is differentiable, we can attempt to predict the highest-scoring structure by simple gradient-based optimization.

It's not obvious that these scores are coherent, i.e., that they normalize when summed over structures (as opposed to features). It turns out that they are, as long as the feature representation is injective, meaning that every structure has a unique feature representation.To avoid measure-theoretic details, consider the case where both the structures yy and features ϕ=g(y)\phi = g(y) are discrete. In general, the structure likelihoods will sum to

Q: How does this connect to the 'traditional' structured learning setup above?

A: It's a special case, in that our features are just of yy, but also a slight generalization. If we view the features as one-hot distance bins, then our energy is actually roughly a linear combination of these bins, but with weights that depend on X because they're output by the neural net. Conceptually that's fine.

Q1: What enables this supervised training procedure? It's that our weights are a function of XX, and our overall energy function has the form of a conditional log-density on those features. That is it's normalized over y's features by definition (except for the steric strain term which we don't learn). And this makes it normalized over y, because y's features are overcomplete; they exactly determine y.

  • In the setting of something like part-of-speech tagging, we'd train classifiers to predict individual tags, or maybe pairs of tags. But single tags aren't useful because we're not capturing structure, and pairs of tags are a bit weird? It's not crazy though.

Q2: could we train AlphaFold by CD? This would involve sampling a (protein, structure) pair, taking one HMC step over the structure, and then taking the difference in gradients wrt θ\theta between the true and sampled structure as a noisy estimate of the gradient. It's a little silly because in this case we know that the normalization term is constant. If we use this knowledge to set its gradient to zero instead of the noisy estimate, then contrastive divergence reduces to just maximizing the likelihood of the true y, which is what's actually done.

Q3: can we learn features g? For example, the number or boundaries of the distance bins? If we just stick these in with theta, and jointly optimize how good the features are, we'll be incentivized to learn trivial features. But if we can frame as a distribution on y; for example the maximum-entropy distribution given the features, then in principle this is a well-posed problem. We certainly could use CD or similar to solve it.

Q: what are the connections to energy-based models?

A: from Yann Lecun's survey, EBMs do not need to have a probabilistic interpretation. There is a data-fit term that 'pushes down' on the energy of correct models, and some other term that 'pushes up' on the energy of incorrect models. In the conditional probability framework, this second term is the log normalizer. But you could also define it as something like a margin: the distance to the nearest point at which the model makes an incorrect prediction.