Is there a way to perform grid search hyper-parameter optimization on One-Class SVM
I ran into this same problem and found this question while searching for a solution. I ended up finding a solution that uses GridSearchCV
and am leaving this answer for anyone else who searches and finds this question.
The cv
parameter of the GridSearchCV class can take as its input an iterable yielding (train, test) splits as arrays of indices. You can generate splits that use only data from the positive class in the training folds, and the remaining data in the positive class plus all data in the negative class in the testing folds.
You can use sklearn.model_selection.KFold
to make the splits
from sklearn.model_selection import KFold
Suppose Xpos
is an nXp numpy array of data for the positive class for the OneClassSVM
and Xneg
is an mXp array of data for known anomalous examples.
You can first generate splits for Xpos
using
splits = KFold(n_splits=5).split(Xpos)
This will construct a generator of tuples of the form (train, test)
where train
is a numpy array of int containing indices for the examples in a training fold and test
is a numpy array containing indices for examples in a test fold.
You can then combine Xpos
and Xneg
into a single dataset using
X = np.concatenate([Xpos, Xneg], axis=0)
The OneClassSVM
will make prediction 1.0
for examples it thinks are in the positive class and prediction -1.0
for examples it thinks are anomalous. We can make labels for our data using
y = np.concatenate([np.repeat(1.0, len(Xpos)), np.repeat(-1.0, len(Xneg))])
We can then make a new generator of (train, test)
splits with indices for the anomalous examples included in the test folds.
n, m = len(Xpos), len(Xneg)
splits = ((train, np.concatenate([test, np.arange(n, n + m)], axis=0)
for train, test in splits)
You can then pass these splits to GridSearchCV
using the data X, y
and whatever scoring method and other parameters you wish.
grid_search = GridSearchCV(estimator, param_grid, cv=splits, scoring=...)
Edit: I hadn’t noticed that this approach was suggested in the comments of the other answer by Vivek Kumar, and that the OP had rejected it because they didn’t believe it would work with their method of choosing the best parameters. I still prefer the approach I’ve described because GridSearchCV will automatically handle multiprocessing and provides exception handling and informative warning and error messages.
It is also flexible in the choice of scoring method. You can use multiple scoring methods by passing a dictionary mapping strings to scoring callables and even define custom scoring callables. This is described in the Scikit-learn documentation here. A bespoke method of choosing the best parameters could likely be implemented with a custom scoring function. All of the metrics used by the OP could be included using the dictionary approach described in the documentation.
You can find a real world example here. I'll make a note to change the link when this gets merged into master.
Yes, there is a way to search over hyper-parameters without performing cross-validation over input data. This method is called ParameterGrid()
and is stored in sklearn.model_selection
. Here is the link to the official documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html
Your case might look like the following:
grid = {'gamma' : np.logspace(-9, 3, 13),
'nu' : np.linspace(0.01, 0.99, 99)}
To assert all the steps possible with the grid you may type list(ParameterGrid(grid))
. We may also check its length via len(list(ParameterGrid(grid)))
which totally gives 1287 and thus 1287 models to fit on the train data.
To use the method you necessarily need a for loop for that. Implying you have clf variable as you unfitted one-class SVM imported from sklearn.svm
the loop will look something like below:
for z in ParameterGrid(grid):
clf.set_params(**z)
clf.fit(X_train, y_train)
clf.predict(X_test)
...
I hope that suffices. Do not forget that names in grid should be coherent with parameter of one-class SVM. To get the names of these parameters you may type clf.get_params().keys()
, and there you would see you 'gamma' and 'nu'.