Why does SVD provide the least squares and least norm solution to $ A x = b $?
First, consider the problem $\Sigma x = b$, where $$ \Sigma = \pmatrix{\sigma_1\\& \ddots\\&&\sigma_r\\ &&&0\\&&&&\ddots\\&&&&&0} $$ Note that $b$ is only in the range of $\Sigma$ if its entries $b_{r+1},\dots,b_n$ are all zero. Furthermore, you should be able to convince yourself (geometrically or otherwise) that the least squares solution must be $$ x = (b_1/\sigma_1,\dots,b_r/\sigma_r,0,\dots,0)^T = \Sigma^+ b $$ From there, note that $$ U\Sigma V^T x = b \implies\\ \Sigma (V^T x ) = U^T b $$ By the above argument, the least squares solution for $(V^T x)$ is given by $V^T x = \Sigma^+ U^T b$. Noting that $\|V^T x\| = \|x\|$, we can use this to conclude that $x = (V \Sigma ^+ U^T)b$ must be the least squares solution (for $x$).
I hope you find this explanation sufficient.
The pseudoinverse solution from the SVD is derived in proving standard least square problem with SVD.
Given $\mathbf{A}x=b$, where the data vector $b\notin\mathcal{N}\left( \mathbf{A}^{*} \right)$, the least squares solution exists and is given by $$ x_{LS} = \color{blue}{\mathbf{A}^{\dagger}b} + \color{red}{\left( \mathbf{I}_{n} - \mathbf{A}^{\dagger}\mathbf{A}\right) y}, \quad y\in\mathbb{C}^{n} $$ where blue vectors are in the range space $\color{blue}{\mathcal{R}\left( \mathbf{A}^{*} \right)}$ and red vectors are in the null space $\color{red}{\mathcal{N}\left( \mathbf{A} \right)}.$ The least squares solution $r^{2}$ minimizes the sum of the squares of the residual errors and is an affine space. That is $$ \lVert \mathbf{A} x_{LS} (y) \rVert_{2}^{2} = r^{2}_{min} $$ for all values of $y$.
What is the vector in this affine space with the smallest length? The length of the solution vectors is $$ \lVert x_{LS} \rVert_{2}^{2} = \lVert \color{blue}{\mathbf{A}^{\dagger}b} + \color{red}{\left( \mathbf{I}_{n} - \mathbf{A}^{\dagger}\mathbf{A}\right) y} \rVert_{2}^{2} = \lVert \color{blue}{\mathbf{A}^{\dagger}b} \rVert_{2}^{2} + \underbrace{\lVert \color{red}{\left( \mathbf{I}_{n} - \mathbf{A}^{\dagger}\mathbf{A}\right) y} \rVert_{2}^{2}}_{y=\mathbf{0}} $$ The solution vector of minimum length is $\color{blue}{\mathbf{A}^{\dagger}b}$, the point in the affine space closest to the origin.
The resource linked below really helped me understand this. The transformation $A$ can be interpreted in 2D as mapping the unit circle to an elipse. This can be done in a 3 step process using the SVD:
- Rotate the unit circle so it can be stretched along its axis
- Stretch each axis to form the ellipse
- Rotate again to align the ellipse with the output space of $A$
To solve for $x$, you reverse this process, starting with $b$.
Least squares comes in when step 2 creates a ellipse with a width of zero. When you're going through this process in reverse, when you get to step 2, un-stretching throws away that dimension with a width of zero. Still, you're left with the closest point to $b$ in the output space of $A$. Continuing through the reversed process gets you to $x'$.
In other words, the transformation $A$ maps the unit circle to a line instead of an ellipse, and you've found the $x$ for which $Ax$ results in the closest point on that line to point $b$.
https://www.cs.cornell.edu/courses/cs3220/2010sp/notes/svd.pdf