Classifier training on a large data set

After posting this answer OP clarified the question with:

To be more specific, I am not looking for the solution which will improve my result but, for the one which will allow me to load a big trining set.

This answer is about getting good classification results over a large dataset, not about on-boarding large datasets into Classify.

Investigation steps

A list of the steps I took follows.

Download the data and import it.
Basic data analysis.
- There are duplicates in the data.
- Removed them with Union.
Flatten the 2D arrays.
Further data analysis.
For example, applying SVD produced 3-5 significant vectors.
That though was not that useful for the classification.
Take a sample of 10^5 or 3*10^5 records and do classification experiments.
Relatively good results were obtained.
Use large samples, e.g. 1.5*10^6 records.
Turned out that using a training set sample of ~5*10^4 and doing classification over the rest produced fairly good results.

For the experiments I used the package "MonaticContextualClassification.m". See the article "A monad for classification workflows".

The important class label is the label "0" (or "No" below) because only ~13% of the records have it. Using the ROC curves and selecting the threshold 0.175 with the obtained classifier produced the following classifier performance measures:

(* <|FalsePositiveRate-><|No->0.138243,Yes->0.124086|>,
     Recall-><|No->0.875914,Yes->0.861757|>,
     Specificity-><|No->0.861757,Yes->0.875914|>,
     MatthewsCorrelationCoefficient-><|No->0.737745,Yes->0.737745|>,
     Accuracy->0.863442|> *)

I assume the overall accuracy and class label recall shown above are satisfactory. If not further experiments can be done using classifier ensembles, removing outliers, etc.

Code

AbsoluteTiming[
 trainingSet = Import["~/Downloads/traningSet.mx"];
]
Length[trainingSet]

(* {3.36351, Null}

   3467240 *)

(* Remove duplicates. *)
AbsoluteTiming[
 trainingSet2 = Union[trainingSet];
]
Length[trainingSet2]

(* {16.4582, Null}
    3312223 *)

(* Flatten 2D arrays. *) 
trainingSet2[[All, 1]] = Flatten /@ trainingSet2[[All, 1]];
Dimensions[trainingSet2[[All, 1]]]

(* {3312223, 20} *)

(* See the class labels representation. *)
Tally[trainingSet2[[All, 2]]]

(* {{1., 2918036}, {0., 394187}} *)

(* Using categorical labels. *)
trainingSet2[[All, 2]] = trainingSet2[[All, 2]] /. {1. -> "Yes", 0. -> "No"};
Tally[trainingSet2[[All, 2]]]

(* {{"Yes", 2918036}, {"No", 394187}} *)

(* Taking relatively smaller sample. *)
SeedRandom[3439]
trainingSet2Small = RandomSample[trainingSet2, 3*10^5];

(* Load the ClCon package. *)
Import["https://raw.githubusercontent.com/antononcube/\
MathematicaForPrediction/master/MonadicProgramming/\
MonadicContextualClassification.m"]

(* Importing from GitHub: StateMonadCodeGenerator.m

   Importing from GitHub: ClassifierEnsembles.m

   Importing from GitHub: ROCFunctions.m

   Importing from GitHub: VariableImportanceByClassifiers.m *)

(* Application of the software monad ClCon. *)
AbsoluteTiming[
 clObj =
   ClConUnit[trainingSet2Small]⟹
    ClConSplitData[0.2, 0.2]⟹
    ClConTrainClassifier[{"GradientBoostedTrees"}]⟹
    ClConClassifierMeasurements⟹
    ClConEchoValue⟹
    ClConROCPlot["FPR", "TPR", ImageSize -> 350]⟹
    ClConROCListLinePlot[{"TPR", "FPR", "SPC", "MCC", "Accuracy"}, ImageSize -> 350];
 ]

enter image description here

 (* {70.5563, Null} *)


(* This gets the classifier: *)
(clObj⟹ClConTakeClassifier)[[1]]

(* This verifies that the obtained classifier produces good results \
with the selected threshold, 0.175 . *)
AbsoluteTiming[
 ClConUnit[]⟹
   ClConSetTestData[trainingSet2]⟹
   ClConSetClassifier[clObj⟹ClConTakeClassifier]⟹
   ClConClassifierMeasurementsByThreshold[{ "FalsePositiveRate", "Recall", "Specificity", "MatthewsCorrelationCoefficient", "Accuracy"}, "No" -> 0.175]⟹
   ClConEchoValue;
 ]

(* value: <|FalsePositiveRate-><|No->0.138243,Yes->0.124086|>,
            Recall-><|No->0.875914,Yes->0.861757|>,
            Specificity-><|No->0.861757,Yes->0.875914|>,
            MatthewsCorrelationCoefficient-><|No->0.737745,Yes->0.737745|>,
            Accuracy->0.863442|> *)

(* {229.404, Null} *)

The Problem

We have a training set of ~2.7 million elements with which we wish to train a random forest classifier. However, Mathematica errors out before the classifier is finished training.

What Went Wrong

Mathematica can have trouble with large datasets, but that isn't the problem here. Let's investigate the data:

 AbsoluteTiming[trainingSet = Union[Import["~/Downloads/trainingSet.mx"]];
 {Length[trainingSet], 
     Length[trainingSet] - Length[DeleteDuplicates[trainingSet]]}

{16.9372,Null}
{2748500, 14522}

One problem is that the training set has duplicate elements. Since we are training a binary classifier, we want the categories to be equally represented. Let's check the distribution.

Count[Union[trainingSet], _ -> 1.0]

So of the 2733978 unique elements, ~80% of them are in category 1. This is not a good training set, because flipping a coin will give an accuracy of 80%.

Training with adjusted data

We split the dataset up into a training set and a test set, each of which are equal parts category 0 and category 1 data. Note that I am using 100% of the 0 category data to do this.

trainingSet = DeleteDuplicates[Import["~/Downloads/trainingSet.mx"]];
SeedRandom[42];
trainingSet = 
  RandomSample[
   Join[RandomSample[Select[trainingSet, Last[#] == 1.0 &], 573753], 
    Select[trainingSet, Last[#] == 0.0 &]]];
{trainingSet, testSet} = {Take[trainingSet, 573753], 
   Take[trainingSet, -573753]};

Train and save the result to disk just in case. I'm using the notebook frontend, so I use TrainingProgressReporting -> "Panel".

rfModel = 
   Classify[trainingSet, 
   Method -> {"RandomForest", "LeafSize" -> 6, "TreeNumber" -> 850}, 
   TrainingProgressReporting -> "Panel"];
DumpSave["~/Downloads/rfModel.mx", rfModel];

Evaluate the performance

It's always best to create a ClassifierMeasurementsObject so you don't have to run the test again.

cmo = ClassifierMeasurements[rfModel, testSet];
cmo["Report"]

Transpose[{
    {"MatthewsCorrelationCoefficient", "CohenKappa", "EvaluationTime"}, 
    cmo /@ {"MatthewsCorrelationCoefficient", "CohenKappa", "EvaluationTime"}
}] // TableForm

MatthewsCorrelationCoefficient  <|0.->0.917584,1.->0.917584|>
CohenKappa                      0.916081
EvaluationTime                  0.0371

If you set it loose on your entire 2,733,978, be sure to use something like the Matthews Correlation Coefficient to account for the fact that you have wildly different numbers of elements in each category.

Classifier training on a large data set

Investigation steps

Code

The Problem

What Went Wrong

Training with adjusted data

Evaluate the performance

Tags:

Machine Learning

Performance Tuning

Related

Recent Posts