Sunday, October 27, 2013

Let's have a "party" and tear this place "rpart"!

For many problems, classification and regression trees can be a simple and elegant solution, assuming you know their well-documented strengths and weaknesses.  I first explored their use several years ago with JMP, which is easy to use.  If you do not have JMP Pro, you will not be able to use the more advanced techniques (ensemble methods if you will) like bagging, boosting, random forest etc.  I don't have JMP Pro and with great angst I realized I'm not Mr. Ensemble and need to get with the program.  Alas, if you can make it with R, you can make it anywhere.

Before I drive myself mad with bagging and boosting, I wanted to cover the basic methods.  It seems through a cursory search of the internet that the R packages "party" and "rpart" are worth learning and evaluating.  I applied them to a data set on college football I've been compiling from the website  Keep in mind, the analysis below will reveal no insights on breaking Vegas.  It is just a few simple variables cobbled together to learn the packages.

> cfb = read.csv(file.choose())

> head(cfb)

Looking at the data, you can see wins = total wins for 2012, points = average points scored per game, allowed = average points the opponents scored per game, ygained = average yards gained per game, yallowed = average yards opponents gained per game, margin = average turnover margin per game and bowl = yes for 6 wins or more, else no (6 wins makes a team bowl eligible).

#giving party a try first
> library(party)

> cfb_tree = ctree(bowl ~ points + allowed + ygained + yallowed + margin, data=cfb)  #build the classification model to predict bowl eligibility

> table(predict(cfb_tree), cfb$bowl)  #examine the confusion matrix

77 of the 124 division 1 teams won 6 or more games.  The model predicts 76 teams, correctly classifying 71 of them.  

> print(cfb_tree)

> plot(cfb_tree)

Print and plot allow you to examine the model.  Again, nothing earth-shattering here.  Various options are available to change the appearance of the tree.  The optimal number of splits is automatically determined with the party package.

On to rpart!

> library(rpart)
> cfb_part = rpart(bowl ~ points + allowed + ygained + yallowed + margin, data=cfb)
> print(cfb_part)

> plot(cfb_part)
> text(cfb_part)

Notice that the split calculations are slightly different.  I'm not sure why, but plan to dig into this fact.  Also, rpart does not auto optimize the number of splits.  Here is how to investigate the matter:

> print(cfb_part$cptable)

Look at row 3 (three splits) and its associated xerror.  You can see that 0.42 is the lowest and tells us that 3 tree splits is optimal.  Another way to do this...

> opt = which.min(cfb_part$cptable[,"xerror"])
> cp = cfb_part$cptable[opt, "CP"]
> cfb_prune = prune(cfb_part, cp = cp)
> print(cfb_prune)

> plot(cfb_prune, margin=.05)
> text(cfb_prune)

The rather prosaic trees can be jazzed up, especially by using the rpart.prp package.

Which package do I prefer?  Well, like most things this little experiment has raised more questions than answers.  

Indeed, no rest for the wicked.