How to approximate a solution to a matrix equation?
I want to correct something in both Anton's ans las3rjock's answers. I hope that I am merely correcting bad notation.
The way to find x' (using the notation in the question) is, as correctly stated in each of the answers above, to derive the associated problem AT Ax' = AT b and solve that via Gaussian Elimination (also called Gauss-Jordan Elimination, the difference is technical and not important here). This is guaranteed to have a solution.
The fact that this is the correct solution to the problem relies on the properties of the "closest point" of a point to a subspace. We want the closest point to b on the subspace Im A. Call this b'. The properties of the "closest point" imply that the difference, b - b', is orthogonal to everything in Im A. It's simple to draw a picture to convince yourself that if c is in Im A and b - c is not orthogonal to everything in Im A then it is possible to "nudge" c a little, either away or towards the origin, to c' so that b - c' is shorter than b - c.
So b - b' is orthogonal to everything in Im A. Since being orthogonal to something is a linear condition, it is sufficient to check that b - b' is orthogonal to a spanning set for Im A, for which we can take the columns of A. As we are using the standard inner product, this means that for each column, say a of A, aT(b - b') = 0. Putting these together, we obtain the relation AT(b - b') = 0. As b' is in the image of A, there is some x' such that b' = A x', whence we see that x' satisfies ATb - ATA x' = 0, and get the desired formula on rearranging. This also guarantees the existence of a solution to this equation.
Using the fact that the closest point is the unique point b' in Im A such that b - b' is orthogonal to everything in Im A, we can run this argument backwards to see that if x' is a solution of AT A x' = AT b then Ax' is the closest point to b in Im A.
Where the above answers go wrong is to then talk about the matrix (AT A)-1 AT. The problem with this is that ATA may not be invertible (take A to be the 2 by 3 zero matrix). There is a matrix which when ATA is invertible is (ATA)-1AT and this is called the pseudo-inverse. Essentially, AT misses the kernel of ATA meaning that the composition is always well-defined but it might not be decomposable as the notation suggests.
This notation may be standard, of that I don't know, but if it is then it is bad notation because it suggests a property that may not hold. At the least, it should always carry a rider to make clear that the notation is merely suggestive and not to be taken literally.
Edit: Andrew is absolutely correct that using (ATA)-1 is at best sloppy (in my case just a flat-out error) because ATA may not be invertible. If ATA is not invertible, it is because there is a linear dependence among the columns of A. In this case, you can remove some of the columns without changing the image until they are linearly independent. The explanation I provide below works once you've removed these "extra" columns.
Anon's and las3rjock's answers are correct: the x' you're looking for is (ATA)-1ATb. But I wanted to add a neat explanation I've heard for why this produces the correct answer.
The image of A, vectors of the form Ax, are all linear combination of the columns of A (the entries of x being the coefficients in the linear combination). So if no solution to the equation Ax=b exists, finding the "best approximate solution" amounts to asking for is the projection of b onto the image of A.
Claim: The operator A(ATA)-1AT is orthogonal projection onto the image of A.
Proof: (1) This operator is equal to its square, so it is a projection:
A(ATA)-1AT A(ATA)-1AT = A(ATA)-1AT.(2) For any vector Ax in the image of A, we have (A(ATA)-1AT) Ax = Ax, so the image of this projection is equal to the image of A.
(3) The projection is orthogonal. Let me denote inner product by <-,->. Suppose y is a vector orthogonal to Ax for all x, so <y,Ax>=0 for all x, then we need to show that A(ATA)-1ATy=0. For any vector z, we have (by the defining property of transpose)
<A(ATA)-1ATy,z> = <y,A (BTz)> = 0
where B=(ATA)-1AT. Since A(ATA)-1ATy dots to zero with every z, it must be zero.
So if b is not in the image of A, the closest vector that is in the image of A is b' = A(ATA)-1ATb, which is clearly A applied to x'=(ATA)-1ATb
Andrew has a correct answer in the pseudoinverse $A^+$, which is characterized by the property that $x = A^+b$ is the shortest vector that solves $A^TAx = A^Tb$ (equivalently, $x$ has zero nullspace(A) component). Computationally, it is typically found using a singular value decomposition: if
$\displaystyle A = U \Sigma V^T = \left[ \begin{array}{cc} U_{col} & U_{null} \end{array} \right] \left[ \begin{array}{cc} \Sigma_{pos} & 0 \\\ 0 & 0 \end{array} \right] \left[ \begin{array}{cc} V_{row} & V_{null} \end{array} \right]^T,$
then the pseudoinverse is $A^+ = V_{row} \Sigma_{pos}^{-1} U_{col}^T$.
I'd like to mention that the pseudoinverse is both rather computationally expensive and unstable in the presence of noise. In particular, if $b$ is given by taking real-world data with limited accuracy, and $A$ is ill-conditioned (e.g., singular), the output of least-squares can vary wildly with the error. One common approach to rectify this is Tikhonov regularization, which typically means minimizing $\Vert Ax - b \Vert^2 + \alpha \Vert x \Vert^2$ for some small $\alpha$. This generically yields a nonsingular optimization problem, which can be computed quickly by Gaussian elimination, and as $\alpha$ approaches zero, the solution approaches the pseudoinverse solution. It will not in general yield an exact solution, but there are error-minimizing heuristics (e.g., using the Discrepancy Principle) for choosing $\alpha$ based on knowledge about the size the noise.
Reference: Strang, Computational Science and Engineering