Gradient Descent with constraints (lagrange multipliers)

Probably too late to be helpful to the OP but may be useful to others in the same situation:

A problem with absolute-value constraints can often be reformulated into an equivalent problem that only has linear constraints, by adding a few "helper" variables.

For example, consider problem 1:

Find (x1,x2) that minimises f(x1,x2) subject to the nonlinear constraint |x1|+|x2|<=10.

There is a linear-constraint version, problem 2:

Find (x1,x2,x3,x4) that minimises f(x1,x2) subject to the following linear constraints:

  1. x1<=x3
  2. -x1<=x3
  3. x2<=x4
  4. -x2<=x4
  5. x3+x4<=10

Note:

  • If (x1,x2,x3,x4) satisfies constraints for problem 2, then (x1,x2) satisfies constraints for problem 1 (because x3 >= abs(x1), x4 >= abs(x2) )
  • If (x1,x2) satisfies constraints for problem 1, then we can extend to (x1,x2,x3,x4) satisfying constraints for problem 2 by setting x3=abs(x1), x4=abs(x2)
  • x3,x4 have no effect on the target function

It follows that finding an optimum for problem 2 will give you an optimum for problem 1, and vice versa.


The problem is that when using Lagrange multipliers, the critical points don't occur at local minima of the Lagrangian - they occur at saddle points instead. Since the gradient descent algorithm is designed to find local minima, it fails to converge when you give it a problem with constraints.

There are typically three solutions:

  • Use a numerical method which is capable of finding saddle points, e.g. Newton's method. These typically require analytical expressions for both the gradient and the Hessian, however.
  • Use penalty methods. Here you add an extra (smooth) term to your cost function, which is zero when the constraints are satisfied (or nearly satisfied) and very large when they are not satisfied. You can then run gradient descent as usual. However, this often has poor convergence properties, as it makes many small adjustments to ensure the parameters satisfy the constraints.
  • Instead of looking for critical points of the Lagrangian, minimize the square of the gradient of the Lagrangian. Obviously, if all derivatives of the Lagrangian are zero, then the square of the gradient will be zero, and since the square of something can never be less then zero, you will find the same solutions as you would by extremizing the Lagrangian. However, if you want to use gradient descent then you need an expression for the gradient of the square of the gradient of the Lagrangian, which might not be easy to come by.

Personally, I would go with the third approach, and find the gradient of the square of the gradient of the Lagrangian numerically if it's too difficult to get an analytic expression for it.

Also, you don't quite make it clear in your question - are you using gradient descent, or CG (conjugate gradients)?