Exhaustively feature selection in scikit-learn?

No, best subset selection is not implemented. The easiest way to do it is to write it yourself. This should get you started:

from itertools import chain, combinations
from sklearn.cross_validation import cross_val_score

def best_subset_cv(estimator, X, y, cv=3):
    n_features = X.shape[1]
    subsets = chain.from_iterable(combinations(xrange(k), k + 1)
                                  for k in xrange(n_features))

    best_score = -np.inf
    best_subset = None
    for subset in subsets:
        score = cross_val_score(estimator, X[:, subset], y, cv=cv).mean()
        if score > best_score:
            best_score, best_subset = score, subset

    return best_subset, best_score

This performs k-fold cross-validation inside the loop, so it will fit k 2 estimators when giving data with p features.


You might want to take a look at MLxtend's Exhaustive Feature Selector. It is obviously not built into scikit-learn (yet?) but does support its classifier and regressor objects.


Combining the answer of Fred Foo and the comments of nopper, ihadanny and jimijazz, the following code gets the same results as the R function regsubsets() (part of the leaps library) for the first example in Lab 1 (6.5.1 Best Subset Selection) in the book "An Introduction to Statistical Learning with Applications in R".

from itertools import combinations
from sklearn.cross_validation import cross_val_score

def best_subset(estimator, X, y, max_size=8, cv=5):
'''Calculates the best model of up to max_size features of X.
   estimator must have a fit and score functions.
   X must be a DataFrame.'''

    n_features = X.shape[1]
    subsets = (combinations(range(n_features), k + 1) 
               for k in range(min(n_features, max_size)))

    best_size_subset = []
    for subsets_k in subsets:  # for each list of subsets of the same size
        best_score = -np.inf
        best_subset = None
        for subset in subsets_k: # for each subset
            estimator.fit(X.iloc[:, list(subset)], y)
            # get the subset with the best score among subsets of the same size
            score = estimator.score(X.iloc[:, list(subset)], y)
            if score > best_score:
                best_score, best_subset = score, subset
        # to compare subsets of different sizes we must use CV
        # first store the best subset of each size
        best_size_subset.append(best_subset)

    # compare best subsets of each size
    best_score = -np.inf
    best_subset = None
    list_scores = []
    for subset in best_size_subset:
        score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean()
        list_scores.append(score)
        if score > best_score:
            best_score, best_subset = score, subset

    return best_subset, best_score, best_size_subset, list_scores

See notebook at http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection

Tags:

Scikit Learn