Lagrange multiplier: Nonlinear Function
Created: July 06, 2022
Modified: April 29, 2023

Lagrange multiplier

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

We're given a constrained optimization problemNote that the standard formulation of Lagrange multipliers handles only equality constraints. While inequality constraints could be handled by constraint reduction, the Karush-Kuhn-Tucker conditions provide a generalization that handles inequality constraints more naturally., as

minxf(x) s.t. c(x)=0\begin{align*} \min_\mathbf{x} &f(\mathbf{x})\\ \text{ s.t. } c(\mathbf{x}) &= \mathbf{0} \end{align*}

A key thing to notice is that at any solution to this problem, the gradients xf(x)\nabla_\mathbf{x} f(\mathbf{x}) and xc(x)\nabla_\mathbf{x} c(\mathbf{x}) of the objective and constraint functions (respectively) must be parallel to each other. Why? Suppose we can make a small move ϵ\mathbf{\epsilon} that locally preserves the constraint

c(x+ϵ)c(x)+ϵ,xc(x)=c(x),c(\mathbf{x} + \mathbf{\epsilon}) \approx c(\mathbf{x}) + \langle \mathbf{\epsilon},\nabla_\mathbf{x} c(\mathbf{x})\rangle = c(\mathbf{x}),

by moving perpendicular to the gradient of the constraint (i.e., within its tangent plane). If any such move can decrease ff, we should do it! We only reach a candidate solution when no more such moves are possible: at this point, any ϵ\mathbf{\epsilon} that is perpendicular to the constraint gradient is also perpendicular to the objective gradient. It follows that the constraint and objective gradients must be parallel; equivalently, there exists some scalar λ\lambda such that

xf(x)=λxc(x).\nabla_\mathbf{x} f(\mathbf{x}) = -\lambda\nabla_\mathbf{x} c(\mathbf{x}).

This λ\lambda is exactly the Lagrange multiplier. We can interpret it as the 'price' of the constraint - it tells us how much we could decrease the objective if we were willing to move in the direction of increasing the constraint.

Lagrangian formulation

Our general approach is to extend the objective ff into a new function L\mathcal{L}, the 'Lagrangian', that incorporates the constraint with weight λ\lambda,

L(x,λ)=f(x)+λc(x).\mathcal{L}(\mathbf{x}, \lambda)=f(\mathbf{x}) + \lambda c(\mathbf{x}).

Finding a stationary point L=0\nabla \mathcal{L} = \mathbf{0} of the Lagrangian corresponds to imposing the two conditions that the constraints are satisfied,

λL(x,λ)=0    c(x)=0\begin{align*} \nabla_\lambda \mathcal{L}(\mathbf{x}, \lambda) = 0 \implies c(\mathbf{x}) = 0 \end{align*}

and that the gradients are aligned (as discussed above)

xL(x,λ)=0    xf(x)=λxc(x)\begin{align*} \nabla_\mathbf{x} \mathcal{L}(\mathbf{x}, \lambda) &= \mathbf{0} \implies \nabla_\mathbf{x}f(\mathbf{x}) = -\lambda \nabla_\mathbf{x} c(\mathbf{x}) \end{align*}

so that no local change to x\mathbf{x} further can improve ff without disturbing cc.

When things are nice, we can sometimes just write out these conditions as a system of equations and solve for the stationary points in closed form.

More generally, we can use numerical optimization. Because the solutions are, in general, saddle points of L\mathcal{L}TODO: is there an argument for why this is? Trivially, because L\mathcal{L} is linear in λ\lambda we will always have Lλ2=0\frac{\partial \mathcal{L}}{\partial \lambda^2} = 0. But I feel like there's more intuition to be had here., we can't just use methods designed to find local minima.

Note that we can always convert a saddle-point problem to a minimization problem in terms of squared gradients; that is, instead of searching for points where L=0\nabla \mathcal{L} = \mathbf{0}, we search for points that minimize the gradient norm L2\|\nabla \mathcal{L}\|^2 (note that this will usually require second-order derivatives). The next section considers an alternative approach.

TODO: connections to physics more generally. We can always convert a Lagrangian to a Hamiltonian by the Legendre transform, which connects to convex duality.

The dual formulation

We can view the constrained optimization problem as a bi-level optimization of the Lagrangian,

minxmaxλL(x,λ).\min_\mathbf{x} \max_\lambda \mathcal{L}(\mathbf{x}, \lambda).

It's easy to see that the inner term maxλL(x,λ)\max_\lambda \mathcal{L}(\mathbf{x}, \lambda) is infinite if the constraint is violated, and equal to f(x)f(\mathbf{x}) otherwise. Thus the outer minimization sees a loss surface

p(x)=maxλL(x,λ)={f(x) if h(x)=0otherwisep(\mathbf{x}) = \max_\lambda L(\mathbf{x}, \lambda) = \left\{\begin{array}{lr}f(\mathbf{x}) &\text{ if } h(\mathbf{x})=0\\\infty &\text{otherwise}\end{array}\right.

equal to the original objective but with infinitely high walls where the constraints are violated. This view of the optimization problem as minxp(x)\min_\textbf{x} p(\mathbf{x}) is called the 'primal' formulation . It's not usually super useful, due to the nondifferentiability of pp, but will serve as a starting point.

Following minimax duality, we can swap the order of optimization to obtain a lower bound on the objective

maxλminxL(x,λ)minxmaxλL(x,λ)\max_\lambda \min_\mathbf{x} \mathcal{L}(\mathbf{x}, \lambda) \le \min_\mathbf{x} \max_\lambda \mathcal{L}(\mathbf{x}, \lambda)

where the left hand side is the 'dual' formulation and its inner term

g(λ)=minxL(x,λ)\begin{align*} g(\lambda) = \min_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \lambda) \end{align*}

is called the Lagrange dual function. For problems where this lower bound is tight, we say that strong duality holds. Slater's condition establishes that this is the case for most convex optimization problems.

We can think about this as a two-player game, where the outer optimizer 'plays first' by choosing a value for λ\lambda, and the inner optimizer responds by finding the x\mathbf{x} that minimizes the corresponding regularized objective. And indeed, we can operationalize this as an algorithm, dual gradient ascent:

  1. Start with any initial λ\lambda.
  2. Solve for x(λ)=argminxL(x,λ)\mathbf{x}^*(\lambda) = \arg\min_\mathbf{x} \mathcal{L}(\mathbf{x}, \lambda) using the unconstrained optimization method of your choice (e.g., run gradient descent, or even just do closed-form calculus).
  3. Now update λ\lambda following the gradient of the dual function
    λg(λ)=λL(x(λ),λ)=L(x,λ)λ+L(x,λ)xxλ=L(x,λ)λ\begin{align*} \nabla_\lambda g(\lambda) &= \nabla_\lambda \mathcal{L}(x^*(\lambda), \lambda)\\ &= \frac{\partial \mathcal{L}(x^*, \lambda)}{\partial \lambda} + \frac{\partial \mathcal{L}(x^*, \lambda)}{\partial x^*}\frac{\partial x^*}{\partial \lambda}\\ &= \frac{\partial \mathcal{L}(x^*, \lambda)}{\partial \lambda}\\ \end{align*}
    where we use the fact that L(x,λ)x=0\frac{\partial \mathcal{L}(x^*, \lambda)}{\partial x^*} = 0 (temporarily reverting to scalar notation for clarity) since xx^* optimizes L\mathcal{L}.
  4. Repeat steps 2-3 until convergence.

Note that gg is always concave in λ\lambda (regardless of ff) because the Lagrangian L(x,)L(x, \cdot) for any given value of xx is just a linear function in λ\lambda, and gg is the pointwise minimum of these functions. So following its gradient is guaranteed to converge to an optimum of the dual function gg.

When is this not sufficient to

References

https://people.seas.harvard.edu/~yaron/AM221-S16/lecture_notes/AM221_lecture12.pdf

https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture7.pdf

https://www.cis.upenn.edu/~cis515/ws-book-IIb.pdf