How to minimize $f(x) = \|Ax-b\|$

\begin{eqnarray*} \left \lVert Ax-y\right\rVert^{2} &=& x^{T} A^{T} A x -2 y^{T} A x + y^{T} y \end{eqnarray*}

Gradient w.r.t $x$ to $0$ translates to

\begin{eqnarray*} \nabla_{x} \left \lVert Ax-y\right\rVert^{2} &=& 2 A^{T} A x - 2A^{T} y =0 \end{eqnarray*}

yields, the normal equations. The solution is (the well known Least Square).

\begin{equation*} \hat{x}_{LS} = \left(A^{T} A \right)^{-1} A^{T} y \end{equation*}

Geometrically, the least square solution can be viewed as an (orthogonal) projection of the observation $y$ onto the image of $A$.


Alternative:

In case you want to stick to $\lVert Ax - b \rVert = \mathrm{tr} \left[ \left(\left( Ax - b \right)^T \left( Ax - b \right) \right)^{\frac{1}{2}} \right]$, not the squared $\ell_2$ norm, i.e., $\lVert Ax - b \rVert^2$ (even though the final solution is same), here is an alternative.

We will utilize the following the identities

  • Trace and Frobenius product relation $$\left\langle A, B \right\rangle={\rm tr}(A^TB)$$ or $$\left\langle A^T, B \right\rangle ={\rm tr}(AB)$$
  • Cyclic property of Trace/Frobenius product \begin{align} \left\langle A, B C \right\rangle &= \left\langle AC^T, B \right\rangle \\ &= \left\langle B^T A, C \right\rangle \\ &= {\text{etc.}} \cr \end{align}

We obtain the differential first, and then the gradient.

Moreover, you can utilize this trace differential result \begin{align} d\,{\rm tr}(f(M)) &= \left\langle f^\prime(M^T), \ dM \right\rangle ,\\ \end{align} where $M = \left( Ax - b \right)^T \left( Ax - b \right)$ in the considered case and $f(\cdot) = \sqrt(\cdot)$.

So, \begin{align} d\, \lVert Ax - b \rVert &= d\, \mathrm{tr} \left[ \left(\left( Ax - b \right)^T \left( Ax - b \right) \right)^{\frac{1}{2}} \right] \\ &= \left\langle \frac {1} {2} \underbrace{\left[ \left(\left( Ax - b \right)^T \left( Ax - b \right) \right)^{-\frac{1}{2}} \right] }_{= \frac{1}{\lVert Ax - b \rVert}} , \ d\left[ \left(\left( Ax - b \right)^T \left( Ax - b \right) \right) \right] \right\rangle \\ &= \left\langle \frac {1} {2 \lVert Ax - b \rVert}, \ dx^T A^T \left(Ax-b \right) + \left(Ax-b \right)^T A \ dx \right\rangle \\ &= \left\langle \frac {1} {2 \lVert Ax - b \rVert}, \ 2\left(Ax-b \right)^T A \ dx \right\rangle \\ &= \left\langle \frac {A^T \left(Ax-b \right)} {\lVert Ax - b \rVert}, \ \ dx \right\rangle \\ \end{align}

Thus the derivative is $$ \eqalign { \frac { \partial} {\partial x}\lVert Ax - b \rVert &= \frac {A^T \left(Ax-b \right)} {\lVert Ax - b \rVert} . \cr } $$

Now, you can set the derivative to $0$ and the solution is the same as the above, i.e., \begin{align} x = \left(A^T A \right)^{-1} A^T b \ . \end{align}


To expand one of the comments of your question, the norm minimization problem you are considering is a projection problem. Note that projection of a vector $b$ onto a set $S$ is nothing but finding a point $x\in S$ such that the distance between $b$ and $x$ is minimized. Define the projection function as $\Pi_S:\mathbb R^m\to S$. The projection of $x$ onto $S$ is given by $\Pi_S(x)$: $$ \Pi_s(x)=\arg\min_{y\in S}\|x-y\|. $$ Therefore if $S$ is defined to be the range of $A$, that is $$ S=\{Ax: x\in\mathbb R^n\}. $$ Then the problem of minimizing $\|b-Ax\|$, which is a least square problem, can be understood as projecting $b$ onto the rang of $A$, i.e., to find $\Pi_S(b)$.

The projection is given by a linear operator, say $P$. $P$ is symmetric and moreover by definition the projection of each vector $y\in S$ onto $S$ will be itself: $Py=y$. This means that $PA=A$. As an exercise try to see that the projection satisfies the intuitive idea that $Px-x$ should be orthogonal to each vector in $S$ (Hint: $A(I-P)=0$). See the figure below: enter image description here

The projection matrix is unique (see why!) and therefore for each $b$ there is a unique point on the range of $A$ as the projection of $b$ namely $Pb$.

The projection into the range of $A$ can be found using Moore-Penrose inverse of $A$. The Moore-Penrose inverse of $A$, $A^\dagger$ satisfies among others: $$ AA^\dagger A=A, A^\dagger AA^\dagger=A^\dagger. $$ and $AA^\dagger$ is symmetric. With small efforts one can see that $P=AA^\dagger$. Therefore $z=AA^\dagger b$ gives you the projection of $b$ onto $S$. So we found the point $z$ in the range of $A$ minimizing the distance to $b$. But what about finding $x$ such that $Ax=z$? Well, in general there is no unique solution however we can see that: $$ x=A^\dagger b\implies Ax=AA^\dagger b=z. $$ Hence, $x=A^\dagger b$ is a solution. Moreover for $v$ in the kernel of $A$, $x+v$ is also a solution. The only way for unique solution is for $A$ to have only zero vector in the kernel.

Then how can we find $A^\dagger$? If $A^TA$ is invertible, $A^\dagger=(A^TA)^{-1}A^T$ as you could see in the other answers. Otherwise, there are other methods like using singular value decomposition to find the Moore-Penrose inverse.