Created: July 07, 2022
Modified: July 15, 2022

penalties are constraints

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

We often see optimization problems with objectives of the form

L(x) = f(x) + \beta\phi(x)

where $f(x)$ is the main function of interest (e.g., training loss in machine learning) and $\phi(x)$ is a nonnegative penalty or regularization term, e.g. $\phi(x) = \|x\|^2$ , to encourage 'smaller' or 'simpler' values of $x$ .

This is equivalent to a constrained optimization problem

\begin{align*} \min f(x)\\ \text{s.t.} \phi(x) \le \epsilon \end{align*}

for some value of $\epsilon$ . The basic intuition is that we can reconstruct the original objective from the Lagrangian of this constrained optimization problem. In particular, write

\mathcal{L}(x, \lambda) = f(x) + \lambda (\phi(x) - \epsilon)

The solution corresponds to some critical point $x^*, \lambda^*$ of the Lagrangian. Suppose we knew the appropriate $\lambda^*$ . Then finding $x^*$ is just a matter of finding a critical point of

L(x, \lambda^*) = f(x) + \lambda^* \phi(x) + \ldots

where we elide the term $\lambda^*\epsilon$ since it is constant with respect to $x$ .

do we know the critical point is a saddle point? ie, are we minimizing with respect to x? the Karush-Kuhn-Tucker conditions tell us that if strong duality holds, then

Relationship between $\lambda$ and $\epsilon$

What is the optimal value $\lambda^*$ ? Before attempting to give a general answer, let's look at a simple example with a closed-form solution,

\begin{align*}\min_{x} f(x) &= x_1 + x_2\\\text{ s.t. } \phi(x) &= x_1^2 + x_2^2 = \epsilon\end{align*}

where we can construct the Lagrangian

\mathcal{L}(x, \lambda) = x_1 + x_2 - \lambda(x_1^2 + x_2^2 - \epsilon)

and solve for its critical points in $x$ ,

\begin{align*} \frac{\partial\mathcal{L}}{\partial x_1} &= 0 &\implies x_1^* &= \frac{1}{2\lambda}\\ \frac{\partial\mathcal{L}}{\partial x_2} &= 0 &\implies x_2^* &= \frac{1}{2\lambda}. \end{align*}

Plugging these in we reduce the Lagrangian to a function of $\lambda$ , the dual objective

d(\lambda) = \mathcal{L}(x^*(\lambda), \lambda)) = \frac{1}{2\lambda} + \lambda\epsilon

which we can then differentiate and solve $\frac{dd(\lambda)}{d\lambda} = 0$ to yield the solution

\lambda^* = \frac{1}{\sqrt{2\epsilon}}.

So for this particular problem, we find that the penalized objective with $\lambda^* = \frac{1}{\sqrt{2\epsilon}}$ has the same solution as the constrained objective with bound $\epsilon$ . Conversely, the penalized objective $f(x) + \lambda \phi(x)$ would have the same solution as the constrained objective with bound $\frac{1}{2\lambda^2}$ obtained by inverting the previous relation.

General case

We can in general define the dual objective

d(\lambda) = \min_x \mathcal{L}(x, \lambda) = f(x^*) - \lambda\left(\phi(x^*) - \epsilon\right)

where $x^*(\lambda)$ is the location of the optimum for a given $\lambda$ . Solving for the $\lambda$ where $\nabla_\lambda d(\lambda)$ vanishes will give some relationship between $\lambda$ and $\epsilon$ for a particular problem. In general we have

\begin{align*} \nabla_\lambda d(\lambda) = \nabla_\lambda f(x^*) - \lambda \nabla_\lambda \phi(x^*) - \left(\phi(x^*) - \epsilon\right) \end{align*}

for

\begin{align*} \nabla_\lambda f(x^*) = \left(\partial_\lambda x^*\right)\nabla f(x^*)\\ \nabla_\lambda \phi(x^*) = \left(\partial_\lambda x^*\right)\nabla \phi(x^*) \end{align*}

in which the Jacobian matrix $\partial_\lambda x^* = -\nabla^2_{\lambda,x} \mathcal{L}\left(\nabla^2_x \mathcal{L}\right)^{-1}$ can be worked out by implicit differentiation of the optimality condition $\nabla_x \mathcal{L}(x^*, \lambda) = 0$ that defines $x^*$ , but actually turns out not to be needed! Why? Recall that at optimality we have parallel gradients $\nabla_x f(x) = \lambda \nabla_x \phi(x)$ ; in particular these continue to be parallel if we apply the same linear transformation --- for example, the Jacobian $\partial_\lambda x^*$ --- to both sides. So these terms cancel, and we're left with

\nabla_\lambda d(\lambda) = \epsilon-\phi(x^*(\lambda)).

Setting this to zero, we see that if things are sufficiently 'nice', then $\epsilon = \phi(x^*(\lambda))$ and correspondingly $\lambda^* = \lambda(\phi^{-1}(\epsilon))$ where:

$\phi^{-1}$ is the function that maps from the constraint bound $\epsilon$ to the $x^*$ that saturates that constraint bound
$\lambda(x^*)$ is the inverse of $x^*(\lambda)$ ; it gives us the $\lambda$ that would produce a specific value of $x^*$ This doesn't seem super helpful? It's a very implicit definition.

I want to be able to say something like:

the penalty is a scaling term on the relative gradient of the objective and the constraint at the optimum
the constraint bound is about the value of the constraint at the optimum
so their relationship depends on the relative strengths of the constraint and the objective, at the point where the optimum is achieved

penalties are constraints

Relationship between λ\lambdaλ and ϵ\epsilonϵ

General case

Meta

Relationship between $\lambda$ and $\epsilon$