Predict: How to customize model construction? (e.g. cross fold validation, mtry, number features used, ROC curve)

Question 3

ROC curve?

So in general, knowing that Method -> "RandomForest" uses Breiman-Cutler ensembles, how can we tweak Predict?

For basic ROC usage see this blog post "Basic example of using ROC with Linear regression".

For ROC for classifier ensembles (using Classify not Predict) see the blog post "ROC for classifier ensembles, bootstrapping, damaging, and interpolation".

Here is an example of ROC plot for and ensemble of classifiers over the Adult dataset:

"ROC-for-AdultDataset-EnsembleClassifiers"

References

Here are the packages used for ROC and classifier ensembles in the blog posts above:

[1] Anton Antonov, Receiver operating characteristic functions Mathematica package, (2016), source code at MathematicaForPrediction at GitHub, package ROCFunctions.m.

[2] Anton Antonov, Classifier ensembles functions Mathematica package, (2016), source code at MathematicaForPrediction at GitHub, package ClassifierEnsembles.m.


Answer on question 1

You can find possible options via Options[p] where p = Predict[data->output,"Method"->method]

Method->{"RandomForest", "TreeNumber" -> 100, "LeafSize" -> 10, "VariableSampleSize" -> 10}

Method->{"NeuralNetwork", "HiddenLayers"-> {{4, "RectifiedLinear"}, {3, "Tanh"}}}

(* Possible layers are: {"LogisticSigmoid", "RectifiedLinear", "Tanh", "SoftRectifiedLinear", "Linear"} *)

Method -> {"NearestNeighbors", "NeighborsNumber" -> 6}

Method -> {"LinearRegression", "L1Regularization" -> 0., "L2Regularization" -> 0.}

If you want to use only 85 features, you can select only them:

p = Predict[data[[;;,features]]->output]

Answer on question 2

There is function KFoldSplit in the package MachineLearning. So probably 'yes' - Mathematica does cross-validation.

?MachineLearning`PackageScope`*

KFoldSplit[data, nsplit] outputs nsplit pairs {trainingset, validationset} according to the K-Fold cross validation procedure.

KFoldSplit[data, nsplit, i] outputs the i-th split.

data can be X or {X, Y} where X is the list of feature vectors, and Y is the response vector.