Translate

Monday, December 30, 2013

Spurious Regression of Time Series

spu.ri.ous
adjective



: not genuine, sincere, or authentic
: based on false ideas or bad reasoning

http://www.merriam-webster.com/dictionary/spurious

When it comes to analysis of time series, just because you can, doesn't mean you should, particularly with regards to regression.  In short, if you have highly autoregressive time series and you build an OLS model, you will find estimates and t-statistics indicating a relationship when non exists.  Without getting into the theory of the problem, let's just simply go over an example using R.  If you want to look at the proper way of looking at the relationship between x or several x's versus y, I recommend the VARS package in R.  If you want to dabble in causality, then explore Granger Causality, which I touch on in my very first post (the ultimate econometrics cynic, Nassim Taleb even recommends the technique in his book, Antifragile: Things That Gain From Disorder).

# produce two randomwalks

> rwalk1 = c(cumsum(rnorm(200)))
> rwalk1.ts = ts(rwalk1)
> rwalk2 = c(cumsum(rnorm(200)))
> rwalk2.ts = ts(rwalk2)
#use the seqplot from tseries package
> seqplot.ts(rwalk1.ts, rwalk2.ts,)
 
 
 
 











These two series are completely random walks (highly autoregressive) and we should have no relationship whatsoever between them.

#build a linear model and examine it

> spurious = lm(rwalk1.ts ~ rwalk2.ts)
> summary(spurious)
 
Call:
lm(formula = rwalk1.ts ~ rwalk2.ts)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.341  -4.495  -1.508   2.602  13.488 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.79154    1.15145   3.293  0.00117 ** 
rwalk2.ts   -0.31904    0.06928  -4.605 7.36e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.87 on 198 degrees of freedom
Multiple R-squared:  0.09675, Adjusted R-squared:  0.09219 
F-statistic: 21.21 on 1 and 198 DF,  p-value: 7.363e-06
 

We have a highly significant p-value.  This will happen whenever you regress a random walk on another.  The implication is that you must have stationary data e.g. first differencing or do dynamic regression.


> acf(resid(spurious)) #autocorrelation plot of residuals showing we violate the OLS assumption of no serial correlation
 
 














To do a dynamic regression, we need to regress Y on X and lagged Y.


> lag.walk1 = lag(rwalk1.ts, -1) #creates lagged Ys
> y = cbind(rwalk1.ts, rwalk2.ts, lag.walk1) #ties the three series together,this is important in R!
> head(y) #verify that it worked
 
       rwalk1.ts  rwalk2.ts   lag.walk1
[1,]  0.23560254 -1.7652148          NA
[2,] -1.55501435 -0.9889759  0.23560254
[3,]  0.05655502  0.4675756 -1.55501435
[4,]  0.55352522  2.0692308  0.05655502
[5,]  1.61779922  1.9401852  0.55352522
[6,]  1.11942660  1.4950455  1.61779922
  
> model = lm(y[,1] ~ y[,2] + y[,3]) #the dynamic regression model
> summary(model)
 
Call:
lm(formula = y[, 1] ~ y[, 2] + y[, 3])

Residuals:
     Min       1Q   Median       3Q      Max 
-2.63010 -0.66273  0.04715  0.63053  2.45922 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.036506   0.208782   0.175    0.861    
y[, 2]      -0.007199   0.012823  -0.561    0.575    
y[, 3]       0.994871   0.012368  80.436   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.01 on 196 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.9735, Adjusted R-squared:  0.9732 
F-statistic:  3601 on 2 and 196 DF,  p-value: < 2.2e-16 
 
# As you would expect, the second time series is no longer significant,  but the lagged values are.
   
# examine the residuals from the two models and notice the pattern in the spurious model
> plot(resid(model))
> plot(resid(spurious))
 
 


























 
You can also use the plot(model) command to examine the residuals in detail.

There you have it. Time series correlation and regression are famous last words.  The Redneck equivalent of, "here hold my beer and watch this".  

Tuesday, December 10, 2013

Part 3: Random Forests and Model Selection Considerations

I want to wrap this series up on the breast cancer data set and move on to other topics.  Here I will include the random forest technique and evaluate all three modeling techniques together, including the conditional inference tree and bootstrap aggregation.  I was going to include several other techniques here, but have decided to move on to other data sets and to explore several of the machine learning R packages to a greater extent.  In my investigation of R machine learning, I feel there is a paucity of easy to follow tutorials.  I shall endeavor to close that gap.

 I'll spend some time here going over the ROC curve and how to use it for model selection.

Random forests utilize the bootstrap with replacement like we performed last time, but additionally randomly sample a subset of variables at each tree node, leaving out roughly a third.  The following website from Cal Berkeley provides some excellent background information on the topic.

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#overview

We can use the party package from earlier, but let's try randomForest.

> # randomforest
> library(randomForest)
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
> forest_train = randomForest(class ~ ., data = train)
> print(forest_train) #notice the number of trees, number of splits and the confusion matrix

Call:
 randomForest(formula = class ~ ., data = train)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 3.39%
Confusion matrix:
                     benign malignant    class.error
benign             292          9          0.02990033
malignant            7       164          0.04093567

> testforest = predict(forest_train, newdata=test)
> table(testforest, test$class) #confusion matrix for test set
         
testforest     benign   malignant
  benign           140         1
  malignant           3       67

This technique performed quite well, now let's see how it does when we compare all three.

> #prepare model for ROC Curve
> test.forest = predict(forest_train, type = "prob", newdata = test)
> forestpred = prediction(test.forest[,2], test$class)
> forestperf = performance(forestpred, "tpr", "fpr")
> plot(perf, main="ROC", colorize=T)
> plot(bagperf, col=2, add=TRUE)
> plot(perf, col=1, add=TRUE)
> plot(forestperf, col=3, add=TRUE)
> legend(0.6, 0.6, c('ctree', 'bagging', 'rforest'), 1:3)
























From the looks of it, random forest may have outperformed bagging, providing a better true positive rate.

The AUC for random forest, bagging and conditional inference are .9967, .9918 and .9854 respectively, and I think confirms the plot above.  Keep in mind that when looking at an ROC plot, the perfect classifier would be a vertical line from 0.0 on the x-axis.

When thinking of diagnostic tests, as is the case with this breast cancer data, one should understand the PPV or Positive Predictive Value.  This can be calculated as the # of true positives / # of total positives

If we compare bagging versus random forest on the test data on malignancy we get:

   bagging PPV = 67/73, = 91.8%
   rforest PPV   = 67/70, = 95.7%

Now, this may not seem like much of a difference, but consider the implications when scaled up to potentially tens of thousands of patients.  How would you like to be one of the 7 that the bagging algorithm incorrectly predicted being malignant?

OK, that wraps up this series.  I think I'll take a break from machine learning and get back to dabbling in time series or other econometric analysis.

Cheers,

Cory


Sunday, December 1, 2013

Ensemble, Part2 (Bootstrap Aggregation)

Part 1 consisted of building a classification tree with the "party" package.  I will now use "ipred" to examine the same data with a bagging (bootstrap aggregation) algorithm.

> library(ipred)
> train_bag = bagging(class ~ ., data=train, coob=T)
> train_bag

Bagging classification trees with 25 bootstrap replications

Call: bagging.data.frame(formula = class ~ ., data = train, coob = T)

Out-of-bag estimate of misclassification error:  0.0424

> table(predict(train_bag), train$class)
         
                       benign malignant
  benign               290           9
  malignant             11       162

> testbag = predict(train_bag, newdata=test)
> table(testbag, test$class)
         
testbag       benign    malignant
  benign          137          1
  malignant          6        67

If you compare the confusion matrices from this week to the prior post, what do you think?

Let's recall the prior ROC curve and combine it with the bagged model.

#prepare bagged model for curve
> test.bagprob = predict(train_bag, type = "prob", newdata = test)
> bagpred = prediction(test.bagprob[,2], test$class)
> bagperf = performance(bagpred, "tpr", "fpr")

> plot(perf, main="ROC", colorize=T)
> plot(bagperf, col=2, add=TRUE)
> plot(perf, col=1, add=TRUE)
> legend(0.6, 0.6, c('ctree', 'bagging'), 1:2)
















As we could see from glancing at the confusion matrices, the bagged model outperforms the standard tree model.  Finally, let's have a look at the AUC (.992 with bagging versus .985 last time around)

> auc.curve = performance(bagpred, "auc")
> auc.curve
An object of class "performance"
Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"

Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.9918244


Slot "alpha.values":
list()

OK, more iterations to come boosting, random forest and no self-respecting data scientist would leave out logistic regression.

Cheers.

Sunday, November 17, 2013

Ensemble Methods, part 1

Last week I dabbled in building classification trees with the party and rpart packages.  Now, I want to put together a series where I can apply those basic trees along with advanced techniques like bagging, boosting and random forest.  Additionally, I've taken the leap and have started using the RStudio interface.  I must say I'm quite pleased at the features and once you get comfortable it can be quite a time-saver.  Have a look at http://www.rstudio.com/.
















Having worked in oncology (sales rep and market research) in a previous lifetime I wanted to apply the techniques to the Wisconsin Breast Cancer data set. It is available at the UC-Irvine Machine Learning Repository.  Here is the direct link... http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Prognostic%29 .

All I intend to do here in this entry, is get the data loaded, create test and training data sets, build a classification tree model using the "party" package and finally, write the code to be able to produce ROC and Lift charts.  In further iterations, I will create additional tree models on this data set and examine which method is the best at predicting whether a tumor is benign or malignant.

This code drops the annoying ID variable, which is of no use and then eliminates the few missing observations, which are of no consequence.

> #dropping ID variable
> breastcancer$ID = NULL
> str(breastcancer)
'data.frame': 699 obs. of  10 variables:
 $ thickness: int  8 6 1 1 1 5 3 3 3 8 ...
 $ Usize    : int  4 6 1 1 1 1 1 1 1 8 ...
 $ Ushape   : int  5 6 1 3 2 1 4 1 3 8 ...
 $ adhesion : int  1 9 1 1 1 1 1 1 1 1 ...
 $ SECS     : int  2 6 1 2 3 2 2 2 2 2 ...
 $ nuclei   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ chromatin: int  7 7 2 2 1 3 3 3 2 6 ...
 $ nucleoli : int  3 8 1 1 1 1 1 1 1 10 ...
 $ mitoses  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ class    : Factor w/ 2 levels "benign","malignant": 2 1 1 1 1 1 1 1 1 2 ...

> bcancer = na.omit(breastcancer)  # delete missing observations
> str(bcancer)
'data.frame': 683 obs. of  10 variables:
 $ thickness: int  5 8 1 10 7 5 6 5 5 9 ...
 $ Usize    : int  4 10 1 7 3 4 10 2 3 4 ...
 $ Ushape   : int  4 10 1 7 2 4 10 3 4 5 ...
 $ adhesion : int  5 8 1 6 10 9 2 1 1 10 ...
 $ SECS     : int  7 7 2 4 5 2 8 6 8 6 ...
 $ nuclei   : int  10 10 10 10 10 10 10 10 10 10 ...
 $ chromatin: int  3 9 3 4 5 5 7 5 4 4 ...
 $ nucleoli : int  2 7 1 1 4 6 3 1 9 8 ...
 $ mitoses  : int  1 1 1 2 4 1 3 1 1 1 ...
 $ class    : Factor w/ 2 levels "benign","malignant": 1 2 1 2 2 2 2 2 2 2 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:16] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..- attr(*, "names")= chr [1:16] "1" "2" "3" "4" ...

Now we create the train and test sets...

> set.seed(1234)
> ind = sample(2,nrow(bcancer), replace=TRUE, prob=c(0.7, 0.3))
> train = bcancer[ind==1,]
> test = bcancer[ind==2,]

Then, we roll up our sleeves, pour a glass of Bourbon and get to work...

> #conditional inference tree with the "party" package
> train_tree = ctree(class ~ ., data=train)
> table(predict(train_tree), train$class)
         
                     benign    malignant
  benign           289           6
  malignant         12       165

So, the model says that 177 patients had malignant tumors whereas in reality it was only 171.

Here is a plot of the tree:

> plot(train_tree, type="simple")



















OK, now it is time to see how we did on the test data...

> testpred = predict(train_tree, newdata=test)
> table(testpred, test$class)
         
testpred       benign      malignant
  benign          138             3
  malignant          5           65

Quite consistent to what we saw with the train data.

Finally, I will start the code where I can compare multiple methods.  We first have to gin-up the probabilities for the test set using our model with this annoying bit of code (this will be easier with other methods).

> test$probabilities = 1 - unlist(treeresponse(train_tree, newdata=test), use.names=F)[seq(1,nrow(test)*2,2)]

> library(ROCR)  #loading the package to help us calculate and visualize the classifiers

> pred = prediction(test$probabilities, test$class) #create an object of the predicted probabilities

#create and plot the ROC chart with true positive rate and false positive rate
> perf = performance(pred, "tpr", "fpr")
> plot(perf, main="ROC", colorize=T)


















#create and plot the lift chart with lift and rate of positive predictions

> perf_lift = performance(pred, "lift", "rpp")
> plot(perf_lift, mail="lift chart", colorize=T)



















Oh, and for good measure I decided to include the area under the curve...

> auc.curve = performance(pred, "auc")
> auc.curve
An object of class "performance"
Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"

Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.9854484

Next time I will introduce random forest and bagging and/or boosting, based on my further research into the subject.

Cheers,

Meister

Sunday, October 27, 2013

Let's have a "party" and tear this place "rpart"!

For many problems, classification and regression trees can be a simple and elegant solution, assuming you know their well-documented strengths and weaknesses.  I first explored their use several years ago with JMP, which is easy to use.  If you do not have JMP Pro, you will not be able to use the more advanced techniques (ensemble methods if you will) like bagging, boosting, random forest etc.  I don't have JMP Pro and with great angst I realized I'm not Mr. Ensemble and need to get with the program.  Alas, if you can make it with R, you can make it anywhere.

Before I drive myself mad with bagging and boosting, I wanted to cover the basic methods.  It seems through a cursory search of the internet that the R packages "party" and "rpart" are worth learning and evaluating.  I applied them to a data set on college football I've been compiling from the website cfbstats.com.  Keep in mind, the analysis below will reveal no insights on breaking Vegas.  It is just a few simple variables cobbled together to learn the packages.

> cfb = read.csv(file.choose())

> head(cfb)







Looking at the data, you can see wins = total wins for 2012, points = average points scored per game, allowed = average points the opponents scored per game, ygained = average yards gained per game, yallowed = average yards opponents gained per game, margin = average turnover margin per game and bowl = yes for 6 wins or more, else no (6 wins makes a team bowl eligible).

#giving party a try first
> library(party)

> cfb_tree = ctree(bowl ~ points + allowed + ygained + yallowed + margin, data=cfb)  #build the classification model to predict bowl eligibility

> table(predict(cfb_tree), cfb$bowl)  #examine the confusion matrix







77 of the 124 division 1 teams won 6 or more games.  The model predicts 76 teams, correctly classifying 71 of them.  

> print(cfb_tree)















> plot(cfb_tree)

















Print and plot allow you to examine the model.  Again, nothing earth-shattering here.  Various options are available to change the appearance of the tree.  The optimal number of splits is automatically determined with the party package.

On to rpart!

> library(rpart)
> cfb_part = rpart(bowl ~ points + allowed + ygained + yallowed + margin, data=cfb)
> print(cfb_part)












> plot(cfb_part)
> text(cfb_part)
















Notice that the split calculations are slightly different.  I'm not sure why, but plan to dig into this fact.  Also, rpart does not auto optimize the number of splits.  Here is how to investigate the matter:

> print(cfb_part$cptable)









Look at row 3 (three splits) and its associated xerror.  You can see that 0.42 is the lowest and tells us that 3 tree splits is optimal.  Another way to do this...

> opt = which.min(cfb_part$cptable[,"xerror"])
> cp = cfb_part$cptable[opt, "CP"]
> cfb_prune = prune(cfb_part, cp = cp)
> print(cfb_prune)










> plot(cfb_prune, margin=.05)
> text(cfb_prune)

























The rather prosaic trees can be jazzed up, especially by using the rpart.prp package.

Which package do I prefer?  Well, like most things this little experiment has raised more questions than answers.  

Indeed, no rest for the wicked.


Wednesday, September 25, 2013

NFL week 3 update

With another NFL week down we are starting to see separation from the contenders and the “better luck next year” teams.  Any paid TV mouthpiece worth their salt will tell you it is a quarterback driven league.  Driven indeed.  In the last post, I dabbled in the simple code of a correlation heat map.  Now, I realize I may have led the flock astray in my haste to create some wow graphics using the data from advancednflstats.com.  But that is the great thing about R, the ability to cook up some code and add salt to taste.  So, let’s kick it up a notch with a look at the week 3 NFL QB data, creating different versions of correlation plots.  Choose your preference, a la carte!

> library(corrplot) #load the versatile package corrplot
> attach(nfl)
> head(nfl)
Note: top 5 quarterbacks thru week 3:
      1.      Peyton Manning
      2.      Jay Cutler
      3.      Ryan Tannehill
      4.      Drew Brees
      5.      Philip Rivers

> qb = cor(nfl[ ,4:15]) # correlation subset of continuous variables

The package corrplot has 7 different visualization methods: "circle", "square", "ellipse", "number", "shade", "color", "pie"


> corrplot(qb, method = "circle")






















> corrplot(qb, method = "ellipse")


























Two nice examples, but this is my favorite below.  I really like the ability to see both the visual portrayal and the statistics on one chart.

> corrplot.mixed(qb)




















If you are interested in further pursuing corrplot, I recommend this website.