What is the intuitive relationship between SVD and PCA?

(I assume for the purposes of this answer that the data has been preprocessed to have zero mean.)

Simply put, the PCA viewpoint requires that one compute the eigenvalues and eigenvectors of the covariance matrix, which is the product $\frac{1}{n-1}\mathbf X\mathbf X^\top$, where $\mathbf X$ is the data matrix. Since the covariance matrix is symmetric, the matrix is diagonalizable, and the eigenvectors can be normalized such that they are orthonormal:

$\frac{1}{n-1}\mathbf X\mathbf X^\top=\frac{1}{n-1}\mathbf W\mathbf D\mathbf W^\top$

On the other hand, applying SVD to the data matrix $\mathbf X$ as follows:

$\mathbf X=\mathbf U\mathbf \Sigma\mathbf V^\top$

and attempting to construct the covariance matrix from this decomposition gives $$ \frac{1}{n-1}\mathbf X\mathbf X^\top =\frac{1}{n-1}(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf U\mathbf \Sigma\mathbf V^\top)^\top = \frac{1}{n-1}(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf V\mathbf \Sigma\mathbf U^\top) $$

and since $\mathbf V$ is an orthogonal matrix ($\mathbf V^\top \mathbf V=\mathbf I$),

$\frac{1}{n-1}\mathbf X\mathbf X^\top=\frac{1}{n-1}\mathbf U\mathbf \Sigma^2 \mathbf U^\top$

and the correspondence is easily seen (the square roots of the eigenvalues of $\mathbf X\mathbf X^\top$ are the singular values of $\mathbf X$, etc.)

In fact, using the SVD to perform PCA makes much better sense numerically than forming the covariance matrix to begin with, since the formation of $\mathbf X\mathbf X^\top$ can cause loss of precision. This is detailed in books on numerical linear algebra, but I'll leave you with an example of a matrix that can be stable SVD'd, but forming $\mathbf X\mathbf X^\top$ can be disastrous, the Läuchli matrix:

$\begin{pmatrix}1&1&1\\ \epsilon&0&0\\0&\epsilon&0\\0&0&\epsilon\end{pmatrix}^\top,$

where $\epsilon$ is a tiny number.


A tutorial on Principal Component Analysis by Jonathon Shlens is a good tutorial on PCA and its relation to SVD. Specifically, section VI: A More General Solution Using SVD.


The question boils down to whether you what to subtract the means and divide by standard deviation first. The same question arises in the context of linear and logistic regression. So I'll reason by analogy.

In many problems our features are positive values such as counts of words or pixel intensities. Typically a higher count or a higher pixel intensity means that a feature is more useful for classification/regression. If you subtract the means then you are forcing features with original value of zero to have a negative value which is high in magnitude. This entails that you make the features values that are non-important to the problem of classification (previously zero valued) as influential as the most important features values (the ones that have high counts or pixel intensities).

The same reasoning holds for PCA. If your features are least sensitive (informative) towards the mean of the distribution, then it makes sense to subtract the mean. If the features are most sensitive towards the high values, then subtracting the mean does not make sense.

SVD does not subtract the means but often as a first step projects the data on the mean of all data points. In this way the SVD first takes care of global structure.