Python scikit learn pca.explained_variance_ratio_ cutoff

Although this question is older than 2 years i want to provide an update on this. I wanted to do the same and it looks like sklearn now provides this feature out of the box.

As stated in the docs

if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

So the code required is now

my_model = PCA(n_components=0.99, svd_solver='full')
my_model.fit_transform(my_matrix)

This worked for me with even less typing in the PCA section. The rest is added for convenience. Only 'data' needs to be defined in an earlier stage.

import sklearn as sl
from sklearn.preprocessing import StandardScaler as ss
from sklearn.decomposition import PCA 

st = ss().fit_transform(data)
pca = PCA(0.80)
pc = pca.fit_transform(st) # << to retain the components in an object
pc

#pca.explained_variance_ratio_
print ( "Components = ", pca.n_components_ , ";\nTotal explained variance = ",
      round(pca.explained_variance_ratio_.sum(),5)  )

Yes, you are nearly right. The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.

You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.

import numpy as np
from sklearn.decomposition import PCA

np.random.seed(0)
my_matrix = np.random.randn(20, 5)

my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)

print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()

[ 1.50756565  1.29374452  0.97042041  0.61712667  0.31529082]
[ 0.32047581  0.27502207  0.20629036  0.13118776  0.067024  ]
[ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]

So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.

Python scikit learn pca.explained_variance_ratio_ cutoff

Tags:

Python

Pca

Scikit Learn

Related

Recent Posts