Monday, December 30, 2013

Spurious Regression of Time Series


: not genuine, sincere, or authentic
: based on false ideas or bad reasoning

When it comes to analysis of time series, just because you can, doesn't mean you should, particularly with regards to regression.  In short, if you have highly autoregressive time series and you build an OLS model, you will find estimates and t-statistics indicating a relationship when non exists.  Without getting into the theory of the problem, let's just simply go over an example using R.  If you want to look at the proper way of looking at the relationship between x or several x's versus y, I recommend the VARS package in R.  If you want to dabble in causality, then explore Granger Causality, which I touch on in my very first post (the ultimate econometrics cynic, Nassim Taleb even recommends the technique in his book, Antifragile: Things That Gain From Disorder).

# produce two randomwalks

> rwalk1 = c(cumsum(rnorm(200)))
> rwalk1.ts = ts(rwalk1)
> rwalk2 = c(cumsum(rnorm(200)))
> rwalk2.ts = ts(rwalk2)
#use the seqplot from tseries package
> seqplot.ts(rwalk1.ts, rwalk2.ts,)

These two series are completely random walks (highly autoregressive) and we should have no relationship whatsoever between them.

#build a linear model and examine it

> spurious = lm(rwalk1.ts ~ rwalk2.ts)
> summary(spurious)
lm(formula = rwalk1.ts ~ rwalk2.ts)

    Min      1Q  Median      3Q     Max 
-10.341  -4.495  -1.508   2.602  13.488 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.79154    1.15145   3.293  0.00117 ** 
rwalk2.ts   -0.31904    0.06928  -4.605 7.36e-06 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.87 on 198 degrees of freedom
Multiple R-squared:  0.09675, Adjusted R-squared:  0.09219 
F-statistic: 21.21 on 1 and 198 DF,  p-value: 7.363e-06

We have a highly significant p-value.  This will happen whenever you regress a random walk on another.  The implication is that you must have stationary data e.g. first differencing or do dynamic regression.

> acf(resid(spurious)) #autocorrelation plot of residuals showing we violate the OLS assumption of no serial correlation

To do a dynamic regression, we need to regress Y on X and lagged Y.

> lag.walk1 = lag(rwalk1.ts, -1) #creates lagged Ys
> y = cbind(rwalk1.ts, rwalk2.ts, lag.walk1) #ties the three series together,this is important in R!
> head(y) #verify that it worked
       rwalk1.ts  rwalk2.ts   lag.walk1
[1,]  0.23560254 -1.7652148          NA
[2,] -1.55501435 -0.9889759  0.23560254
[3,]  0.05655502  0.4675756 -1.55501435
[4,]  0.55352522  2.0692308  0.05655502
[5,]  1.61779922  1.9401852  0.55352522
[6,]  1.11942660  1.4950455  1.61779922
> model = lm(y[,1] ~ y[,2] + y[,3]) #the dynamic regression model
> summary(model)
lm(formula = y[, 1] ~ y[, 2] + y[, 3])

     Min       1Q   Median       3Q      Max 
-2.63010 -0.66273  0.04715  0.63053  2.45922 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.036506   0.208782   0.175    0.861    
y[, 2]      -0.007199   0.012823  -0.561    0.575    
y[, 3]       0.994871   0.012368  80.436   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.01 on 196 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.9735, Adjusted R-squared:  0.9732 
F-statistic:  3601 on 2 and 196 DF,  p-value: < 2.2e-16 
# As you would expect, the second time series is no longer significant,  but the lagged values are.
# examine the residuals from the two models and notice the pattern in the spurious model
> plot(resid(model))
> plot(resid(spurious))

You can also use the plot(model) command to examine the residuals in detail.

There you have it. Time series correlation and regression are famous last words.  The Redneck equivalent of, "here hold my beer and watch this".  

Tuesday, December 10, 2013

Part 3: Random Forests and Model Selection Considerations

I want to wrap this series up on the breast cancer data set and move on to other topics.  Here I will include the random forest technique and evaluate all three modeling techniques together, including the conditional inference tree and bootstrap aggregation.  I was going to include several other techniques here, but have decided to move on to other data sets and to explore several of the machine learning R packages to a greater extent.  In my investigation of R machine learning, I feel there is a paucity of easy to follow tutorials.  I shall endeavor to close that gap.

 I'll spend some time here going over the ROC curve and how to use it for model selection.

Random forests utilize the bootstrap with replacement like we performed last time, but additionally randomly sample a subset of variables at each tree node, leaving out roughly a third.  The following website from Cal Berkeley provides some excellent background information on the topic.

We can use the party package from earlier, but let's try randomForest.

> # randomforest
> library(randomForest)
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
> forest_train = randomForest(class ~ ., data = train)
> print(forest_train) #notice the number of trees, number of splits and the confusion matrix

 randomForest(formula = class ~ ., data = train)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 3.39%
Confusion matrix:
                     benign malignant    class.error
benign             292          9          0.02990033
malignant            7       164          0.04093567

> testforest = predict(forest_train, newdata=test)
> table(testforest, test$class) #confusion matrix for test set
testforest     benign   malignant
  benign           140         1
  malignant           3       67

This technique performed quite well, now let's see how it does when we compare all three.

> #prepare model for ROC Curve
> test.forest = predict(forest_train, type = "prob", newdata = test)
> forestpred = prediction(test.forest[,2], test$class)
> forestperf = performance(forestpred, "tpr", "fpr")
> plot(perf, main="ROC", colorize=T)
> plot(bagperf, col=2, add=TRUE)
> plot(perf, col=1, add=TRUE)
> plot(forestperf, col=3, add=TRUE)
> legend(0.6, 0.6, c('ctree', 'bagging', 'rforest'), 1:3)

From the looks of it, random forest may have outperformed bagging, providing a better true positive rate.

The AUC for random forest, bagging and conditional inference are .9967, .9918 and .9854 respectively, and I think confirms the plot above.  Keep in mind that when looking at an ROC plot, the perfect classifier would be a vertical line from 0.0 on the x-axis.

When thinking of diagnostic tests, as is the case with this breast cancer data, one should understand the PPV or Positive Predictive Value.  This can be calculated as the # of true positives / # of total positives

If we compare bagging versus random forest on the test data on malignancy we get:

   bagging PPV = 67/73, = 91.8%
   rforest PPV   = 67/70, = 95.7%

Now, this may not seem like much of a difference, but consider the implications when scaled up to potentially tens of thousands of patients.  How would you like to be one of the 7 that the bagging algorithm incorrectly predicted being malignant?

OK, that wraps up this series.  I think I'll take a break from machine learning and get back to dabbling in time series or other econometric analysis.



Sunday, December 1, 2013

Ensemble, Part2 (Bootstrap Aggregation)

Part 1 consisted of building a classification tree with the "party" package.  I will now use "ipred" to examine the same data with a bagging (bootstrap aggregation) algorithm.

> library(ipred)
> train_bag = bagging(class ~ ., data=train, coob=T)
> train_bag

Bagging classification trees with 25 bootstrap replications

Call: = class ~ ., data = train, coob = T)

Out-of-bag estimate of misclassification error:  0.0424

> table(predict(train_bag), train$class)
                       benign malignant
  benign               290           9
  malignant             11       162

> testbag = predict(train_bag, newdata=test)
> table(testbag, test$class)
testbag       benign    malignant
  benign          137          1
  malignant          6        67

If you compare the confusion matrices from this week to the prior post, what do you think?

Let's recall the prior ROC curve and combine it with the bagged model.

#prepare bagged model for curve
> test.bagprob = predict(train_bag, type = "prob", newdata = test)
> bagpred = prediction(test.bagprob[,2], test$class)
> bagperf = performance(bagpred, "tpr", "fpr")

> plot(perf, main="ROC", colorize=T)
> plot(bagperf, col=2, add=TRUE)
> plot(perf, col=1, add=TRUE)
> legend(0.6, 0.6, c('ctree', 'bagging'), 1:2)

As we could see from glancing at the confusion matrices, the bagged model outperforms the standard tree model.  Finally, let's have a look at the AUC (.992 with bagging versus .985 last time around)

> auc.curve = performance(bagpred, "auc")
> auc.curve
An object of class "performance"
Slot "":
[1] "None"

Slot "":
[1] "Area under the ROC curve"

Slot "":
[1] "none"

Slot "x.values":

Slot "y.values":
[1] 0.9918244

Slot "alpha.values":

OK, more iterations to come boosting, random forest and no self-respecting data scientist would leave out logistic regression.