Modified: April 29, 2023
Lagrange multiplier
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.We're given a constrained optimization problemNote that the standard formulation of Lagrange multipliers handles only equality constraints. While inequality constraints could be handled by constraint reduction, the Karush-Kuhn-Tucker conditions provide a generalization that handles inequality constraints more naturally., as
A key thing to notice is that at any solution to this problem, the gradients and of the objective and constraint functions (respectively) must be parallel to each other. Why? Suppose we can make a small move that locally preserves the constraint
by moving perpendicular to the gradient of the constraint (i.e., within its tangent plane). If any such move can decrease , we should do it! We only reach a candidate solution when no more such moves are possible: at this point, any that is perpendicular to the constraint gradient is also perpendicular to the objective gradient. It follows that the constraint and objective gradients must be parallel; equivalently, there exists some scalar such that
This is exactly the Lagrange multiplier. We can interpret it as the 'price' of the constraint - it tells us how much we could decrease the objective if we were willing to move in the direction of increasing the constraint.
Lagrangian formulation
Our general approach is to extend the objective into a new function , the 'Lagrangian', that incorporates the constraint with weight ,
Finding a stationary point of the Lagrangian corresponds to imposing the two conditions that the constraints are satisfied,
and that the gradients are aligned (as discussed above)
so that no local change to further can improve without disturbing .
When things are nice, we can sometimes just write out these conditions as a system of equations and solve for the stationary points in closed form.
More generally, we can use numerical optimization. Because the solutions are, in general, saddle points of TODO: is there an argument for why this is? Trivially, because is linear in we will always have . But I feel like there's more intuition to be had here., we can't just use methods designed to find local minima.
Note that we can always convert a saddle-point problem to a minimization problem in terms of squared gradients; that is, instead of searching for points where , we search for points that minimize the gradient norm (note that this will usually require second-order derivatives). The next section considers an alternative approach.
TODO: connections to physics more generally. We can always convert a Lagrangian to a Hamiltonian by the Legendre transform, which connects to convex duality.
The dual formulation
We can view the constrained optimization problem as a bi-level optimization of the Lagrangian,
It's easy to see that the inner term is infinite if the constraint is violated, and equal to otherwise. Thus the outer minimization sees a loss surface
equal to the original objective but with infinitely high walls where the constraints are violated. This view of the optimization problem as is called the 'primal' formulation . It's not usually super useful, due to the nondifferentiability of , but will serve as a starting point.
Following minimax duality, we can swap the order of optimization to obtain a lower bound on the objective
where the left hand side is the 'dual' formulation and its inner term
is called the Lagrange dual function. For problems where this lower bound is tight, we say that strong duality holds. Slater's condition establishes that this is the case for most convex optimization problems.
We can think about this as a two-player game, where the outer optimizer 'plays first' by choosing a value for , and the inner optimizer responds by finding the that minimizes the corresponding regularized objective. And indeed, we can operationalize this as an algorithm, dual gradient ascent:
- Start with any initial .
- Solve for using the unconstrained optimization method of your choice (e.g., run gradient descent, or even just do closed-form calculus).
- Now update following the gradient of the dual functionwhere we use the fact that (temporarily reverting to scalar notation for clarity) since optimizes .
- Repeat steps 2-3 until convergence.
Note that is always concave in (regardless of ) because the Lagrangian for any given value of is just a linear function in , and is the pointwise minimum of these functions. So following its gradient is guaranteed to converge to an optimum of the dual function .
When is this not sufficient to
References
https://people.seas.harvard.edu/~yaron/AM221-S16/lecture_notes/AM221_lecture12.pdf
https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture7.pdf