How do Lagrange multipliers work to find the lowest value of a function subject to a constraint?

This type of problem is generally referred to as constrained optimization. A general technique to solve many of these types of problems is known as the method of Lagrange multipliers, here is an example of such a problem using Lagrange multipliers and a short justification as to why the technique works.

Consider the parabaloid given by $f(x,y) = x^2 + y^2$. The global minimum of this surface lies at the origin (at $x=0$, $y=0$). If we are given the constraint, a requirement on the relationship between $x$ and $y$, that $3x+y=6$, then the origin can no longer be our solution (since $3\cdot 0 + 1 \cdot 0 \neq 6$). Yet, there is a lowest point on this function satisfying the given constraint.

What we have so far:

Objective function: $f(x,y) = x^2 + y^2$,
subject to: $3x+y=6$.

From here we can derive the Lagrange formulation of our constrained minimization problem. This will be a function $L$ of $x$, $y$, and a single Lagrange multiplier $\lambda$ (since we have only a single constraint). It will be this new function that we minimize.

$L(x,y,\lambda) = x^2 + y^2 + \lambda(3x+y-6)$

The Lagrange formulation incorporates our original function along with our constraint(s). On the way toward minimizing $L$, we will have to minimize the objective function $x^2 + y^2$, as well as minimize the contribution from the constraint, which is now weighted by a factor of $\lambda$. If the constraint is met, then the expression $3x+y-6$ will necessarily be zero, and will not contribute anything the value of $L$. This is the trick of the technique.

Minimizing the Lagrange formulation:

To minimize $L$ we simply find the $x,y$, and $\lambda$ values that make its gradient zero. (This is exactly analogous to setting the first derivative to zero in calculus.)

$\nabla L = 0:$

$\frac{\partial L}{\partial x} = 2x + 3 \lambda = 0$

$\frac{\partial L}{\partial y} = 2y + \lambda = 0$

$\frac{\partial L}{\partial \lambda} = 3x + y - 6 = 0$,

In our example we have arrived at a system of simultaneous linear equations which can (and should) be solved with matrix algebra. The solution will be a vector holding values for $x, y$ and $\lambda$. The lowest value of the objective function, subject to the given constraint, sits at $(x,y,f(x,y))$, and the Lagrange multiplier does not have an immediate physical interpretation. (The multipliers have meaning when appearing in certain contexts, more info on that here.)


This elaborates on Mr. Bulatov's answer (note: Mr. Bulatov's link is broken, but I infer it is supposed to link to Dan Klein's notes Lagrange Multipliers without Permanent Scarring, which are indeed excellent).

Anyway, for the sake of simplicity let's work with functions in two variables, i.e. we want to find critical points of $f(x,y)$ where the constraint is $g(x,y)=c$.

Take any point $(a,b)$ in the $xy$-plane such that $g(a,b)=c$. Suppose that there was a (unit) vector $\vec v$ tangent to $g$ at $(a,b)$ whose dot product with the gradient vector of $f$ at $(a,b)$ was non-zero. We can take $\vec v$ such that the dot product with $\vec v$ is positive since $\nabla f(a,b)\cdot \vec v=-(\nabla f(a,b)\cdot -\vec v)$.

Now what this means is that the the partial derivative of $f$ with respect to $\vec v\quad$ is positive and that points along $\vec v\quad$ close enough to $(a,b)$ have higher $f$-value. I (and Mr. Bulatov) claim that this means that there are points on $g(x,y)=c$ close to $(a,b)$ with higher $f$-value, meaning that $f$ is not a maximum. Mr. Bulatov phrases the argument well in terms of infinitesimals, but I feel that a little bit more rigor is illuminating.

Consider a sequence of points $(a_i,b_i)$ on $g(x,y)=c$ that converge to $(a,b)$. Each point $(a_i,b_i)$ comes with a unit vector $\vec u_i$ that corresponds to the direction (is a scalar multiple of) $(a,b) - (a_i,b_i)$. Because $\vec v$ is a tangent vector, we can choose the $(a_i,b_i)$ such that the $\vec u_i$ converge to $\vec v$ . But because $f$ is differentiable in more than one variable, its partial derivatives are continuous, and so we have that the partial derivatives $\nabla f(a,b)\cdot \vec u_i$ converge to $\nabla f(a,b) \cdot \vec v$, and so for all sufficiently large i the partial derivative in the $u_i$ direction is positive and so $f(a_i,b_i)>f(a,b)$.

Therefore, if $f(a,b)$ was a maximum (or a minimum) of $f$ on $g(x,y)=c$, gradient vector of $f$ at $(a,b)$ would be have dot product zero with every vector in the line tangent to $g$ at $(a,b)$. But that means that the gradient vector of $f$ is orthogonal to the tangent line of $g$, which means that the gradient of $f$ must be parallel to (i.e. a scalar multiple of) the gradient of $g$.

From here we obtain the equations $\nabla f(x,y)=\lambda\nabla g(x,y)$ and $g(x,y)=c$, which are satisfied if and only if $(x,y)$ is a minimum of $f(x,y)+\lambda (g(x,y)-c)$ for positive $\lambda$ and a maximum for negative $\lambda$.


You probably figured it out already, but for posterity -- this document gives intuition which I didn't see in Boyd book, esp. Figure 3 on page 3.

To solve equality constrained problem you look for points where gradient of objective is in the same direction as gradient of the constraint function. Why those points? Well, suppose we have a feasible point that's a maximum of objective and the gradients of objective/constraint functions are not collinear. Then you can take an infinitesimal step in the direction orthogonal to the gradient of constraint function to increase the objective. And because you are moving in the direction orthogonal to gradient of the constraint function, your constraint is still satisfied. Hence you get a feasible point with larger objective value, contradiction.