penalties are constraints: Nonlinear Function
Created: July 07, 2022
Modified: July 15, 2022

penalties are constraints

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

We often see optimization problems with objectives of the form

L(x)=f(x)+βϕ(x)L(x) = f(x) + \beta\phi(x)

where f(x)f(x) is the main function of interest (e.g., training loss in machine learning) and ϕ(x)\phi(x) is a nonnegative penalty or regularization term, e.g. ϕ(x)=x2\phi(x) = \|x\|^2, to encourage 'smaller' or 'simpler' values of xx.

This is equivalent to a constrained optimization problem

minf(x)s.t.ϕ(x)ϵ\begin{align*} \min f(x)\\ \text{s.t.} \phi(x) \le \epsilon \end{align*}

for some value of ϵ\epsilon. The basic intuition is that we can reconstruct the original objective from the Lagrangian of this constrained optimization problem. In particular, write

L(x,λ)=f(x)+λ(ϕ(x)ϵ)\mathcal{L}(x, \lambda) = f(x) + \lambda (\phi(x) - \epsilon)

The solution corresponds to some critical point x,λx^*, \lambda^* of the Lagrangian. Suppose we knew the appropriate λ\lambda^*. Then finding xx^* is just a matter of finding a critical point of

L(x,λ)=f(x)+λϕ(x)+L(x, \lambda^*) = f(x) + \lambda^* \phi(x) + \ldots

where we elide the term λϵ\lambda^*\epsilon since it is constant with respect to xx.

do we know the critical point is a saddle point? ie, are we minimizing with respect to x? the Karush-Kuhn-Tucker conditions tell us that if strong duality holds, then

Relationship between λ\lambda and ϵ\epsilon

What is the optimal value λ\lambda^*? Before attempting to give a general answer, let's look at a simple example with a closed-form solution,

minxf(x)=x1+x2 s.t. ϕ(x)=x12+x22=ϵ\begin{align*}\min_{x} f(x) &= x_1 + x_2\\\text{ s.t. } \phi(x) &= x_1^2 + x_2^2 = \epsilon\end{align*}

where we can construct the Lagrangian

L(x,λ)=x1+x2λ(x12+x22ϵ)\mathcal{L}(x, \lambda) = x_1 + x_2 - \lambda(x_1^2 + x_2^2 - \epsilon)

and solve for its critical points in xx,

Lx1=0    x1=12λLx2=0    x2=12λ.\begin{align*} \frac{\partial\mathcal{L}}{\partial x_1} &= 0 &\implies x_1^* &= \frac{1}{2\lambda}\\ \frac{\partial\mathcal{L}}{\partial x_2} &= 0 &\implies x_2^* &= \frac{1}{2\lambda}. \end{align*}

Plugging these in we reduce the Lagrangian to a function of λ\lambda, the dual objective

d(λ)=L(x(λ),λ))=12λ+λϵd(\lambda) = \mathcal{L}(x^*(\lambda), \lambda)) = \frac{1}{2\lambda} + \lambda\epsilon

which we can then differentiate and solve dd(λ)dλ=0\frac{dd(\lambda)}{d\lambda} = 0 to yield the solution

λ=12ϵ.\lambda^* = \frac{1}{\sqrt{2\epsilon}}.

So for this particular problem, we find that the penalized objective with λ=12ϵ\lambda^* = \frac{1}{\sqrt{2\epsilon}} has the same solution as the constrained objective with bound ϵ\epsilon. Conversely, the penalized objective f(x)+λϕ(x)f(x) + \lambda \phi(x) would have the same solution as the constrained objective with bound 12λ2\frac{1}{2\lambda^2} obtained by inverting the previous relation.

General case

We can in general define the dual objective

d(λ)=minxL(x,λ)=f(x)λ(ϕ(x)ϵ)d(\lambda) = \min_x \mathcal{L}(x, \lambda) = f(x^*) - \lambda\left(\phi(x^*) - \epsilon\right)

where x(λ)x^*(\lambda) is the location of the optimum for a given λ\lambda. Solving for the λ\lambda where λd(λ)\nabla_\lambda d(\lambda) vanishes will give some relationship between λ\lambda and ϵ\epsilon for a particular problem. In general we have

λd(λ)=λf(x)λλϕ(x)(ϕ(x)ϵ)\begin{align*} \nabla_\lambda d(\lambda) = \nabla_\lambda f(x^*) - \lambda \nabla_\lambda \phi(x^*) - \left(\phi(x^*) - \epsilon\right) \end{align*}

for

λf(x)=(λx)f(x)λϕ(x)=(λx)ϕ(x)\begin{align*} \nabla_\lambda f(x^*) = \left(\partial_\lambda x^*\right)\nabla f(x^*)\\ \nabla_\lambda \phi(x^*) = \left(\partial_\lambda x^*\right)\nabla \phi(x^*) \end{align*}

in which the Jacobian matrix λx=λ,x2L(x2L)1\partial_\lambda x^* = -\nabla^2_{\lambda,x} \mathcal{L}\left(\nabla^2_x \mathcal{L}\right)^{-1} can be worked out by implicit differentiation of the optimality condition xL(x,λ)=0\nabla_x \mathcal{L}(x^*, \lambda) = 0 that defines xx^*, but actually turns out not to be needed! Why? Recall that at optimality we have parallel gradients xf(x)=λxϕ(x)\nabla_x f(x) = \lambda \nabla_x \phi(x); in particular these continue to be parallel if we apply the same linear transformation --- for example, the Jacobian λx\partial_\lambda x^* --- to both sides. So these terms cancel, and we're left with

λd(λ)=ϵϕ(x(λ)).\nabla_\lambda d(\lambda) = \epsilon-\phi(x^*(\lambda)).

Setting this to zero, we see that if things are sufficiently 'nice', then ϵ=ϕ(x(λ))\epsilon = \phi(x^*(\lambda)) and correspondingly λ=λ(ϕ1(ϵ))\lambda^* = \lambda(\phi^{-1}(\epsilon)) where:

  • ϕ1\phi^{-1} is the function that maps from the constraint bound ϵ\epsilon to the xx^* that saturates that constraint bound
  • λ(x)\lambda(x^*) is the inverse of x(λ)x^*(\lambda); it gives us the λ\lambda that would produce a specific value of xx^* This doesn't seem super helpful? It's a very implicit definition.

I want to be able to say something like:

  • the penalty is a scaling term on the relative gradient of the objective and the constraint at the optimum
  • the constraint bound is about the value of the constraint at the optimum
  • so their relationship depends on the relative strengths of the constraint and the objective, at the point where the optimum is achieved