Modified: July 15, 2022
penalties are constraints
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.We often see optimization problems with objectives of the form
where is the main function of interest (e.g., training loss in machine learning) and is a nonnegative penalty or regularization term, e.g. , to encourage 'smaller' or 'simpler' values of .
This is equivalent to a constrained optimization problem
for some value of . The basic intuition is that we can reconstruct the original objective from the Lagrangian of this constrained optimization problem. In particular, write
The solution corresponds to some critical point of the Lagrangian. Suppose we knew the appropriate . Then finding is just a matter of finding a critical point of
where we elide the term since it is constant with respect to .
do we know the critical point is a saddle point? ie, are we minimizing with respect to x? the Karush-Kuhn-Tucker conditions tell us that if strong duality holds, then
Relationship between and
What is the optimal value ? Before attempting to give a general answer, let's look at a simple example with a closed-form solution,
where we can construct the Lagrangian
and solve for its critical points in ,
Plugging these in we reduce the Lagrangian to a function of , the dual objective
which we can then differentiate and solve to yield the solution
So for this particular problem, we find that the penalized objective with has the same solution as the constrained objective with bound . Conversely, the penalized objective would have the same solution as the constrained objective with bound obtained by inverting the previous relation.
General case
We can in general define the dual objective
where is the location of the optimum for a given . Solving for the where vanishes will give some relationship between and for a particular problem. In general we have
for
in which the Jacobian matrix can be worked out by implicit differentiation of the optimality condition that defines , but actually turns out not to be needed! Why? Recall that at optimality we have parallel gradients ; in particular these continue to be parallel if we apply the same linear transformation --- for example, the Jacobian --- to both sides. So these terms cancel, and we're left with
Setting this to zero, we see that if things are sufficiently 'nice', then and correspondingly where:
- is the function that maps from the constraint bound to the that saturates that constraint bound
- is the inverse of ; it gives us the that would produce a specific value of This doesn't seem super helpful? It's a very implicit definition.
I want to be able to say something like:
- the penalty is a scaling term on the relative gradient of the objective and the constraint at the optimum
- the constraint bound is about the value of the constraint at the optimum
- so their relationship depends on the relative strengths of the constraint and the objective, at the point where the optimum is achieved