Obtain eigen values and vectors from sklearn PCA
I used the sklearn PCA function. The return parameters 'components_' is eigen vectors and 'explained_variance_' is eigen values. Below is my test code.
from sklearn.decomposition import PCA
import numpy as np
def main():
data = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0], [2.3, 2.7], [2, 1.6], [1, 1.1], [1.5, 1.6], [1.1, 0.9]])
print(data)
pca = PCA()
pca.fit(data)
print(pca.components_)
print(pca.explained_variance_)
if __name__ == "__main__":
main()
When you say "eigenvalues", do you mean the "singular values" for PCA? Eigenvalues are only possible when the matrix PCA applied on are square matrix.
If you are trying to use "eigenvalues" to determine the proper dimension needed for PCA, you should actually use singular values. You can just use pca.singular_values_ to get the singular values.
Your implementation
You are computing the eigenvectors of the correlation matrix, that is the covariance matrix of the normalized variables.data/=np.std(data, axis=0)
is not part of the classic PCA, we only center the variables.
So the sklearn PCA does not feature scale the data beforehand.
Apart from that you are on the right track, if we abstract the fact that the code you provided did not run ;).
You only got confused with the row/column layouts. Honestly I think it's much easier to start with X = data.T
and work only with X from there on. I added your code 'fixed' at the end of the post.
Getting the eigenvalues
You already noted that you can get the eigenvectors using clf.components_
.
So you have the principal components. They are eigenvectors of the covariance matrix ðᵀð.
A way to retrieve the eigenvalues from there is to apply this matrix to each principal components and project the results onto the component.
Let v_1 be the first principal component and lambda_1 the associated eigenvalue. We have:
and thus:
since . (x, y) the scalar product of vectors x and y.
Back in Python you can do:
n_samples = X.shape[0]
# We center the data and compute the sample covariance matrix.
X -= np.mean(X, axis=0)
cov_matrix = np.dot(X.T, X) / n_samples
for eigenvector in pca.components_:
print(np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)))
And you get the eigenvalue associated with the eigenvector. Well, in my tests it turned out not to work with the couple last eigenvalues but I'd attribute that to my absence of skills in numerical stability.
Now that's not the best way to get the eigenvalues but it's nice to know where they come from.
The eigenvalues represent the variance in the direction of the eigenvector. So you can get them through the pca.explained_variance_
attribute:
eigenvalues = pca.explained_variance_
Here is a reproducible example that prints the eigenvalues you get with each method:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000)
n_samples = X.shape[0]
pca = PCA()
X_transformed = pca.fit_transform(X)
# We center the data and compute the sample covariance matrix.
X_centered = X - np.mean(X, axis=0)
cov_matrix = np.dot(X_centered.T, X_centered) / n_samples
eigenvalues = pca.explained_variance_
for eigenvalue, eigenvector in zip(eigenvalues, pca.components_):
print(np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)))
print(eigenvalue)
Your original code, fixed
If you run it you'll see the values are consistent. They're not exactly equal because numpy and scikit-learn are not using the same algorithm here.
The main thing was that you were using correlation matrix instead of covariance, as mentioned above. Also you were getting the transposed eigenvectors from numpy which made it very confusing.
import numpy as np
from scipy.stats.mstats import zscore
from sklearn.decomposition import PCA
def pca_code(data):
#raw_implementation
var_per=.98
data-=np.mean(data, axis=0)
# data/=np.std(data, axis=0)
cov_mat=np.cov(data, rowvar=False)
evals, evecs = np.linalg.eigh(cov_mat)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
variance_retained=np.cumsum(evals)/np.sum(evals)
index=np.argmax(variance_retained>=var_per)
evecs = evecs[:,:index+1]
reduced_data=np.dot(evecs.T, data.T).T
print("evals", evals)
print("_"*30)
print(evecs.T[1, :])
print("_"*30)
#using scipy package
clf=PCA(var_per)
X_train=data
X_train=clf.fit_transform(X_train)
print(clf.explained_variance_)
print("_"*30)
print(clf.components_[1,:])
print("__"*30)