Created: July 06, 2022
Modified: April 29, 2023

Lagrange multiplier

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

We're given a constrained optimization problemNote that the standard formulation of Lagrange multipliers handles only equality constraints. While inequality constraints could be handled by constraint reduction, the Karush-Kuhn-Tucker conditions provide a generalization that handles inequality constraints more naturally., as

\begin{align*} \min_\mathbf{x} &f(\mathbf{x})\\ \text{ s.t. } c(\mathbf{x}) &= \mathbf{0} \end{align*}

A key thing to notice is that at any solution to this problem, the gradients $\nabla_\mathbf{x} f(\mathbf{x})$ and $\nabla_\mathbf{x} c(\mathbf{x})$ of the objective and constraint functions (respectively) must be parallel to each other. Why? Suppose we can make a small move $\mathbf{\epsilon}$ that locally preserves the constraint

c(\mathbf{x} + \mathbf{\epsilon}) \approx c(\mathbf{x}) + \langle \mathbf{\epsilon},\nabla_\mathbf{x} c(\mathbf{x})\rangle = c(\mathbf{x}),

by moving perpendicular to the gradient of the constraint (i.e., within its tangent plane). If any such move can decrease $f$ , we should do it! We only reach a candidate solution when no more such moves are possible: at this point, any $\mathbf{\epsilon}$ that is perpendicular to the constraint gradient is also perpendicular to the objective gradient. It follows that the constraint and objective gradients must be parallel; equivalently, there exists some scalar $\lambda$ such that

\nabla_\mathbf{x} f(\mathbf{x}) = -\lambda\nabla_\mathbf{x} c(\mathbf{x}).

This $\lambda$ is exactly the Lagrange multiplier. We can interpret it as the 'price' of the constraint - it tells us how much we could decrease the objective if we were willing to move in the direction of increasing the constraint.

Lagrangian formulation

Our general approach is to extend the objective $f$ into a new function $\mathcal{L}$ , the 'Lagrangian', that incorporates the constraint with weight $\lambda$ ,

\mathcal{L}(\mathbf{x}, \lambda)=f(\mathbf{x}) + \lambda c(\mathbf{x}).

Finding a stationary point $\nabla \mathcal{L} = \mathbf{0}$ of the Lagrangian corresponds to imposing the two conditions that the constraints are satisfied,

\begin{align*} \nabla_\lambda \mathcal{L}(\mathbf{x}, \lambda) = 0 \implies c(\mathbf{x}) = 0 \end{align*}

and that the gradients are aligned (as discussed above)

\begin{align*} \nabla_\mathbf{x} \mathcal{L}(\mathbf{x}, \lambda) &= \mathbf{0} \implies \nabla_\mathbf{x}f(\mathbf{x}) = -\lambda \nabla_\mathbf{x} c(\mathbf{x}) \end{align*}

so that no local change to $\mathbf{x}$ further can improve $f$ without disturbing $c$ .

When things are nice, we can sometimes just write out these conditions as a system of equations and solve for the stationary points in closed form.

More generally, we can use numerical optimization. Because the solutions are, in general, saddle points of $\mathcal{L}$ TODO: is there an argument for why this is? Trivially, because $\mathcal{L}$ is linear in $\lambda$ we will always have $\frac{\partial \mathcal{L}}{\partial \lambda^2} = 0$ . But I feel like there's more intuition to be had here., we can't just use methods designed to find local minima.

Note that we can always convert a saddle-point problem to a minimization problem in terms of squared gradients; that is, instead of searching for points where $\nabla \mathcal{L} = \mathbf{0}$ , we search for points that minimize the gradient norm $\|\nabla \mathcal{L}\|^2$ (note that this will usually require second-order derivatives). The next section considers an alternative approach.

TODO: connections to physics more generally. We can always convert a Lagrangian to a Hamiltonian by the Legendre transform, which connects to convex duality.

The dual formulation

We can view the constrained optimization problem as a bi-level optimization of the Lagrangian,

\min_\mathbf{x} \max_\lambda \mathcal{L}(\mathbf{x}, \lambda).

It's easy to see that the inner term $\max_\lambda \mathcal{L}(\mathbf{x}, \lambda)$ is infinite if the constraint is violated, and equal to $f(\mathbf{x})$ otherwise. Thus the outer minimization sees a loss surface

p(\mathbf{x}) = \max_\lambda L(\mathbf{x}, \lambda) = \left\{\begin{array}{lr}f(\mathbf{x}) &\text{ if } h(\mathbf{x})=0\\\infty &\text{otherwise}\end{array}\right.

equal to the original objective but with infinitely high walls where the constraints are violated. This view of the optimization problem as $\min_\textbf{x} p(\mathbf{x})$ is called the 'primal' formulation . It's not usually super useful, due to the nondifferentiability of $p$ , but will serve as a starting point.

Following minimax duality, we can swap the order of optimization to obtain a lower bound on the objective

\max_\lambda \min_\mathbf{x} \mathcal{L}(\mathbf{x}, \lambda) \le \min_\mathbf{x} \max_\lambda \mathcal{L}(\mathbf{x}, \lambda)

where the left hand side is the 'dual' formulation and its inner term

\begin{align*} g(\lambda) = \min_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \lambda) \end{align*}

is called the Lagrange dual function. For problems where this lower bound is tight, we say that strong duality holds. Slater's condition establishes that this is the case for most convex optimization problems.

We can think about this as a two-player game, where the outer optimizer 'plays first' by choosing a value for $\lambda$ , and the inner optimizer responds by finding the $\mathbf{x}$ that minimizes the corresponding regularized objective. And indeed, we can operationalize this as an algorithm, dual gradient ascent:

Start with any initial $\lambda$ .
Solve for $\mathbf{x}^*(\lambda) = \arg\min_\mathbf{x} \mathcal{L}(\mathbf{x}, \lambda)$ using the unconstrained optimization method of your choice (e.g., run gradient descent, or even just do closed-form calculus).
Now update $\lambda$ following the gradient of the dual function $\begin{align*} \nabla_\lambda g(\lambda) &= \nabla_\lambda \mathcal{L}(x^*(\lambda), \lambda)\\ &= \frac{\partial \mathcal{L}(x^*, \lambda)}{\partial \lambda} + \frac{\partial \mathcal{L}(x^*, \lambda)}{\partial x^*}\frac{\partial x^*}{\partial \lambda}\\ &= \frac{\partial \mathcal{L}(x^*, \lambda)}{\partial \lambda}\\ \end{align*}$ where we use the fact that $\frac{\partial \mathcal{L}(x^*, \lambda)}{\partial x^*} = 0$ (temporarily reverting to scalar notation for clarity) since $x^*$ optimizes $\mathcal{L}$ .
Repeat steps 2-3 until convergence.

Note that $g$ is always concave in $\lambda$ (regardless of $f$ ) because the Lagrangian $L(x, \cdot)$ for any given value of $x$ is just a linear function in $\lambda$ , and $g$ is the pointwise minimum of these functions. So following its gradient is guaranteed to converge to an optimum of the dual function $g$ .

When is this not sufficient to

References

https://people.seas.harvard.edu/~yaron/AM221-S16/lecture_notes/AM221_lecture12.pdf

https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture7.pdf

https://www.cis.upenn.edu/~cis515/ws-book-IIb.pdf

Lagrange multiplier

Lagrangian formulation

The dual formulation

References

Links to this note

Euler-Lagrange

Karush-Kuhn-Tucker conditions

constrained optimization

mirror descent

Hamiltonian

Meta