Legendre transform: Nonlinear Function
Created: July 08, 2022
Modified: July 15, 2022

Legendre transform

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

The Legendre transform of a function f(x)f(x) is defined to be

g(y)=supxy,xf(x).g(y) = \sup_x \langle y, x\rangle - f(x).

An alternative definition for convex functions is that two convex functions ff and gg are Legendre transforms of each other if their derivatives are inverses: g=(f)1g' = (f')^{-1} or equivalently g(f(x))=xg'(f'(x)) = x. There's a nice geometric argument for this in (among other places) Jess Riedel's post. This definition is nice because it's symmetric; it's immediately clear how this gives rise to convex duality.

What is the inverse function of a derivative? Well, the function f(x)f'(x) gives us the derivative of ff at a point xx, so its inverse (f)1(y)(f')^{-1}(y) must take a derivative yy and give back the point xx^* at which ff achieves that derivative (it's clear that this is uniquely defined only when ff is convex).

If that's what gg' is doing, then what is gg doing? Seen as the antiderivative

g(y)=0yg(y~)dy~,g(y) = \int_{0}^y g'(\tilde{y}) d\tilde{y},

we could view g(y)g(y) as the result of the following computation: for all slopes up to yy, find the point xx at which ff has that slope, then sum those points.

This sounds really weird. I don't think I really understand it, but maybe a hint comes from the connection to physics: summing up increments corresponds to re-expressing a function as a dynamics, as a sequence of incremental changes. This connects to the fact that the Legendre transform converts a Lagrangian to a Hamiltonian.

I think it helps to connect back to the geometric view: g(y)g'(y) is the point where ff achieves slope yy, and g(y)g(y) is the intercept of the tangent line of that point. Now as we change yy slightly, how much does the intercept change for a line with slope yy that passes through (x,f(x))(x, f(x))? Essentially we are trying to move a lever with a fixed attachment at distance xx - if xx is really far away, then even a small change in the inclination of the lever will have a large impact on its intercept! The value of gg is the result of all the incremental changes to the lever's intercept as we maneuver it to have slope yy, and each of those changes will have magnitude proportional to the distance of the attachment point for that corresponding value of yy.

Working through an example:

Say f=14x4f = \frac{1}{4}x^4 has f(x)=x3f'(x) = x^3, which is inverted as g(y)=y1/3g'(y) = y^{1/3}. Then

g(y)=0yy~1/3dy~=34y~4/3y~=y.g(y) = \int_{0}^y \tilde{y}^{1/3} d\tilde{y} = \left|\frac{3}{4}\tilde{y}^{4/3}\right|_{\tilde{y}=y}.

Compare this to the official definition

g(y)=supxyxf(x)=yxf(x)x=(f)1(y)=yy1/3f(y1/3)=y4/314y4/3=34y3/4\begin{align*} g(y) &= \sup_{x} yx - f(x) \\ &= \left|yx - f(x)\right|_{x = (f')^{-1}(y)}\\ &= y \cdot y^{1/3} - f(y^{1/3})\\ &= y^{4/3} - \frac{1}{4}y^{4/3}\\ &= \frac{3}{4}y^{3/4} \end{align*}

What does the 'sup' in the definition correspond to? It's the general encoding of 'just evaluate at the inverse derivative'. Notably that inverse derivative xx^* ends up also being the derivative of gg :

ddyg(y)=x(y)=argmaxxxyf(x).\frac{d}{dy} g(y) = x^*(y) = \arg\max_x xy - f(x).

To see this, we compute

ddyg=ddy(yx(y)f(x(y)))=x(y)+ydxdydf(x)dxdxdy\begin{align*} \frac{d}{dy} g &=\frac{d}{dy} \left(y \cdot x^*(y) - f(x^*(y))\right)\\ &= x^*(y) +y \frac{dx^*}{dy} - \frac{df(x^*)}{dx^*}\frac{dx^*}{dy} \end{align*}

where

dxdy=ddy(f)1(y)=(df(x)dx)1=dxdf(x)\frac{dx^*}{dy} = \frac{d }{dy} (f')^{-1}(y)= \left(\frac{df(x^*)}{dx^*}\right)^{-1} = \frac{dx^*}{df(x^*)}

which plugging back in,

ddyg=x(y)+dxdf(x)ydf(x)dxdxdf(x)=x(y)+ydf(x)dx1=x(y)+(11)=x(y)\begin{align*} \frac{d}{dy} g &= x^*(y) + \frac{dx^*}{df(x^*)} y - \frac{df(x)}{dx}\frac{dx^*}{df(x^*)}\\ &= x^*(y) + \frac{y}{\frac{df(x^*)}{dx^*}} - 1\\ &= x^*(y) + (1 - 1)\\ &= x^*(y)\\ \end{align*}

where df(x)dx=f(x)=f((f)1(y))=y\frac{df(x^*)}{dx^*} = f'(x^*) = f'((f')^{-1}(y)) = y.

In optimization

what is the relationship between minimax duality, duality in constrained optimization, and the duality of the Legendre transform?

for an optimization problem with objective ff and constraint hh, we write the Lagrangian as L(x,λ)=f(x)λh(x)L(x, \lambda) = f(x) - \lambda h(x). Now minimizing this with respect to xx is maximizing λh(x)f(x)\lambda h(x) - f(x) with respect to x, so given a lambda we can write d(λ)=maxxλh(x)f(x)d(\lambda) = \max_{x}\lambda h(x) - f(x) as the optimization dual function, the thing that emerges from minimax duality when we put the constraints on the 'outside' rather than the 'inside' of the optimization. If the constant h(x)h(x) is linear in xx then we have (waves hands) d(λ)=maxxλxf(x)d(\lambda) = \max_x \lambda x - f(x) which also just happens to be the Legendre transform of the objective f(x)f(x), in which the conjugate variable λ\lambda turns out to also be the Lagrange multiplier on the constraint.