Tuesday, December 10, 2013

Part 3: Random Forests and Model Selection Considerations

I want to wrap this series up on the breast cancer data set and move on to other topics.  Here I will include the random forest technique and evaluate all three modeling techniques together, including the conditional inference tree and bootstrap aggregation.  I was going to include several other techniques here, but have decided to move on to other data sets and to explore several of the machine learning R packages to a greater extent.  In my investigation of R machine learning, I feel there is a paucity of easy to follow tutorials.  I shall endeavor to close that gap.

 I'll spend some time here going over the ROC curve and how to use it for model selection.

Random forests utilize the bootstrap with replacement like we performed last time, but additionally randomly sample a subset of variables at each tree node, leaving out roughly a third.  The following website from Cal Berkeley provides some excellent background information on the topic.

We can use the party package from earlier, but let's try randomForest.

> # randomforest
> library(randomForest)
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
> forest_train = randomForest(class ~ ., data = train)
> print(forest_train) #notice the number of trees, number of splits and the confusion matrix

 randomForest(formula = class ~ ., data = train)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 3.39%
Confusion matrix:
                     benign malignant    class.error
benign             292          9          0.02990033
malignant            7       164          0.04093567

> testforest = predict(forest_train, newdata=test)
> table(testforest, test$class) #confusion matrix for test set
testforest     benign   malignant
  benign           140         1
  malignant           3       67

This technique performed quite well, now let's see how it does when we compare all three.

> #prepare model for ROC Curve
> test.forest = predict(forest_train, type = "prob", newdata = test)
> forestpred = prediction(test.forest[,2], test$class)
> forestperf = performance(forestpred, "tpr", "fpr")
> plot(perf, main="ROC", colorize=T)
> plot(bagperf, col=2, add=TRUE)
> plot(perf, col=1, add=TRUE)
> plot(forestperf, col=3, add=TRUE)
> legend(0.6, 0.6, c('ctree', 'bagging', 'rforest'), 1:3)

From the looks of it, random forest may have outperformed bagging, providing a better true positive rate.

The AUC for random forest, bagging and conditional inference are .9967, .9918 and .9854 respectively, and I think confirms the plot above.  Keep in mind that when looking at an ROC plot, the perfect classifier would be a vertical line from 0.0 on the x-axis.

When thinking of diagnostic tests, as is the case with this breast cancer data, one should understand the PPV or Positive Predictive Value.  This can be calculated as the # of true positives / # of total positives

If we compare bagging versus random forest on the test data on malignancy we get:

   bagging PPV = 67/73, = 91.8%
   rforest PPV   = 67/70, = 95.7%

Now, this may not seem like much of a difference, but consider the implications when scaled up to potentially tens of thousands of patients.  How would you like to be one of the 7 that the bagging algorithm incorrectly predicted being malignant?

OK, that wraps up this series.  I think I'll take a break from machine learning and get back to dabbling in time series or other econometric analysis.



1 comment:

  1. Nice post! Very didactic. Thanks a lot for your time to share these tutorials.