Translate

Saturday, June 28, 2014

R Blog Readers Survey

I've almost achieved my goal of a minimum of 100 quality responses.  I will keep the survey open for a couple more days, so if you haven't responded, please feel free to have a crack at it.

Regards,

Cory

https://www.surveymonkey.com/s/7DY8DYH

Sunday, June 22, 2014

Survey - Voice of the Reader

Dear Blog Readers,

I have numerous topics I could discuss in future blog posts.  However, I would like to capture those topics that most interest you.  Therefore, I've put together a short survey (see the link below) for everyone to weigh-in on the subject.

I will keep this open for a couple of weeks and once closed, I will publish the results and make the data public.

Thank you in advance,

Cory

https://www.surveymonkey.com/s/7DY8DYH

Wednesday, May 14, 2014

The NFL Reloads: Using R to Have a Look

NFL Draft 2014

It can't be easy being a Jacksonville Jaguars fan. It seems that management has learned nothing from the Blaine Gabbert debacle. By that I mean I'm not impressed with their first round pick Blake Bortles. Bucky Brooks of NFL.com called him a “developmental prospect”. Developmental? Is that what you want from the third overall pick? I hate to say it but I have to agree with Skip Bayless that Bortles is the new Gabbert in J-Ville. We shall see soon enough if I will eat these words!
At any rate, I've downloaded draft data from the web and you can explore the draft at your own leisure. I scrapped it off of wikipedia at this link: http://en.wikipedia.org/wiki/2014_NFL_Draft
I put together some code to get you on your way to exploring

attach(nfldraft)
str(nfldraft)
## 'data.frame':    256 obs. of  8 variables:
##  $ round     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pick      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ team      : Factor w/ 32 levels "Arizona Cardinals",..: 13 29 15 4 23 2 30 8 18 11 ...
##  $ player    : Factor w/ 256 levels "Aaron Colvin ",..: 96 89 25 204 147 97 178 131 12 84 ...
##  $ position  : Factor w/ 18 levels "C","CB","DE",..: 3 11 13 18 9 11 18 2 9 17 ...
##  $ college   : Factor w/ 113 levels "Alabama","Arizona",..: 86 6 99 15 12 95 95 71 100 62 ...
##  $ conference: Factor w/ 24 levels "ACC","Big 12",..: 21 21 24 1 11 21 21 2 17 1 ...
##  $ notes     : Factor w/ 88 levels "","from Arizona[R1 - 5]",..: 1 86 1 20 1 1 1 51 9 1 ...
names(nfldraft)
## [1] "round"      "pick"       "team"       "player"     "position"  
## [6] "college"    "conference" "notes"
mytable = table(position, round)  #create a table of player position and round 
selected
mytable
##         round
## position 1 2 3 4 5 6 7
##      C   0 1 2 2 2 2 1
##      CB  5 1 2 9 4 7 5
##      DE  2 3 3 2 5 3 4
##      DT  2 3 4 3 3 1 4
##      FB  0 0 0 0 0 1 1
##      FS  0 0 0 0 0 0 2
##      G   0 1 6 1 4 2 0
##      K   0 0 0 0 0 0 2
##      LB  5 3 3 5 6 2 7
##      OLB 0 0 0 0 2 1 0
##      OT  5 4 3 1 1 3 4
##      P   0 0 0 0 0 1 0
##      QB  3 2 0 2 2 5 0
##      RB  0 3 5 5 0 4 2
##      S   4 1 2 4 3 1 0
##      SS  0 0 0 0 0 1 2
##      TE  1 3 3 0 1 0 2
##      WR  5 7 3 6 3 5 5
margin.table(mytable)  #generic table with sum...not very useful
## [1] 256
margin.table(mytable, 1)  #sum of positions selected
## position
##   C  CB  DE  DT  FB  FS   G   K  LB OLB  OT   P  QB  RB   S  SS  TE  WR 
##  10  33  22  20   2   2  14   2  31   3  21   1  14  19  15   3  10  34

margin.table(mytable, 2)  #sum of players selected by round
## round
##  1  2  3  4  5  6  7 
## 32 32 36 40 36 39 41

mosaicplot(mytable, main = "Position by Round", xlab = "Position"
ylab = "Round", cex.axis = 1)  #you can use mosaicplot to graphically 
represent the data in the table













What if you want to drill-down on a specific team or the like? Here is some simple code to examine a specific team, in this case the Colts:

colts = nfldraft[which(nfldraft$team == "Indianapolis Colts"), ]  #select the 
rows of data unique to the Colts

colts.table = table(colts$round, colts$position)

margin.table(colts.table, 2)
## 
##   C  CB  DE  DT  FB  FS   G   K  LB OLB  OT   P  QB  RB   S  SS  TE  WR 
##   0   0   1   0   0   0   0   0   1   0   2   0   0   0   0   0   0   1

barplot(colts.table)













There you have it. With some quick modifications to the above, you could produce tables examining players selected by conference, college etc.

Monday, March 17, 2014

MoneyPuck - Best subsets regression of NHL teams

Spring is at hand and it is a time of renewal, March Madness and to settle scores in the NHL.  There are many scores to be settled: Flyers vs. Penguins, Blackhawks vs. Red Wings, Leafs vs. Habs and pretty much everyone else vs. the Bruins.  Like any fire and brimstone hockey fan, clutching my old testament (a well-worn VHS copy of Slapshot), I will be watching with passionate interest as the NHL unleashes a hellish intensity as it comes down the home stretch.  What does it take to win?  I stumbled upon team statistics on nhl.com and decided to have a look with some basic linear regression.  I created a simple .csv file of team stats with 24 variables for each of the 30 teams as of the morning of March 16.

 [1]  "Team"                   "gp"                     "wins"        
 [4]  "losses"                  "ot_losses"           "points"      
 [7]  "reg_ot"                 "home_row"         "road_row"    
[10]  "point_percent"     "g_game"             "ga_game"      
[13]  "fiveon5_ratio"      "play_percent"     "pk_percent"  
[16]  "shots_game"        "sa_game"           "sfirst_percent"
[19]  "trfirst_percent"    "lfirst_percent"     "lsecond_percent"
[22]  "win_outshoot"     "win_outshot"      "fo_percent"  
[25]  "pim"          

The dependent variable here will be "point_percent", which is the percentage points a team achieves per game played.  Since teams have played different numbers of games and because of the NHL point system (I will not get into that here) it is the best reflection of team performance.  I'll discuss other variables in a minute.

Let's get rid of unneeded variables...

> nhl.sub = nhl.3.16[ ,c(1,10:25)]
> #examine a correlation plot
> require(corrplot)
> nhl.corr = cor(nhl.sub[2:17])
> corrplot(nhl.corr, method="ellipse")


        

Setting aside goals scored and allowed per game (g_game, ga_game), we see that we have high correlation with the ratio of goals scored versus opponents when teams are at full strength (fiveon5_ratio), winning percentage when scoring first (sfirst_percent), winning when outshooting an opponent (win_outshoot) and when outshot (win_outshot).  I was surprised that penalty minutes per game (pim) was not correlated with anything.

> #re-make nhl.sub with no goals or goals allowed per game
> nhl.sub = nhl.3.16[ ,c(10,13:25)]
> require(leaps)
> #use the leaps package for best subsets
> #the package defaults to 8 subsets, but I set to 10 for demonstration purposes
> bestsubsets = regsubsets(point_percent~., nhl.sub, nvmax=10)
> #create an object of regsubsets for further evaluation and model selection
> reg.summary = summary(bestsubsets)
> names(reg.summary)
[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat"
[8] "obj"

> #plot of residual sum of squares
> plot(reg.summary$rss ,xlab="Number of Variables ",ylab="RSS", type="l")
















> #looks like somewhere between 4 and 7 variables would be an optimal model...plot Mallow's Cp next
> plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type="l")
















> #5 variables? let's confirm...
> which.min(reg.summary$cp )
[1] 5

> plot(bestsubsets ,scale="Cp")



















The way to read this plot is to look at the top (lowest cp) and a black box corresponds to a variable on the x axis.  So here we see that the 5 on 5 ratio, power play percentage, penalty kill percentage, percentage of wins when scoring first and percentage of wins when trailing after the first period make the cut.

> #examine the coefficients
> coef(bestsubsets ,5)
    (Intercept)           fiveon5_ratio       play_percent      pk_percent
   -0.220482142     0.180024350     0.004574047     0.002724056
     sfirst_percent     trfirst_percent
    0.303930780     0.277814196

Here is some additional analysis I did to look at an ANOVA table and examine residuals i.e. confirm our linear model assumptions, I won't bore you with the details:

> nhl.lm = lm(point_percent ~ fiveon5_ratio + play_percent + pk_percent + sfirst_percent + trfirst_percent)
> anova(nhl.lm)
> summary(nhl.lm)
> plot(nhl.lm)

> #create a plot of actual versus predicted with team names for your viewing pleasure
> nhl.3.16["pred"] = NA
> nhl.3.16$pred = predict(nhl.lm)
> require(ggplot2)
> p = ggplot(nhl.3.16, aes(x=pred, y=point_percent, label=Team))
> p + geom_point() + geom_text(size=3, hjust=-.1, vjust=0)


















By the way, the adjusted R-squared was 0.9614.  Interesting how we see the elite teams like St. Louis and Anaheim clustering together and the has-been teams clustering as well.  Teams like Tampa, Washington and Dallas will all be fighting it out for those playoff spots to the glorious and bitter end.

Mahalo,

Cory

Saturday, March 8, 2014

The "Fighting Sioux" Surge

Even a casual fan of North Dakota Hockey will notice that in the era of Coach Dave Hakstol, the team seems to perform better in the second half of the season than the first.  For the rabid fans of the team like myself, it has become a horrible and interesting fact of life.  By mid-December we are inevitably calling for Hakstol's head on a pike only to be whipped into a blood-thirsty frenzy as the team dominates the ice, slaughtering opponents with buzz-saw precision. The WCHA tourney championships and trips to the Frozen Four will attest to that.  Let's not get into the team's Frozen Four issues...just too painful to write about, too many emotional scars.

On February 28th, , Dave Berger posted on his siouxsports.com blog an excellent analysis of the surge, "The Second-Half Surge: Math or Myth?"

http://blog.siouxsports.com/2014/02/28/the-second-half-surge-math-or-myth-2/

Using the data Dave provided in his blog, I couldn't help but subject it to some analysis in R.  No fancy econometrics, big data or machine learning here, just some good old fashioned statistical tests on a small data set.  The data starts in the 2004-05 season and goes through last year.  In 7 of the 9 years, the team has a better winning percentage in the second half.

year            w1   l1 t1    g1 win1 w2 l2 t2    g2    win2
1 2004-05 13 7 2    22    0.591    12    8    3 23    0.522
2 2005-06 13 8 1    22    0.591 16    8    0    24    0.667
3 2006-07      9 10 1 20    0.450 15 4    4    23    0.652
4 2007-08 9 7 1 17    0.529    19 4 3 26    0.731
5 2008-09 9 10  1    20    0.450    15    5 3    23 0.652
6 2009-10      9    6 3    18 0.500 16 7 2    25    0.640
7 2010-11 14 5    2 21 0.667    18 3    1    22    0.818
8 2011-12    10    8    2    20 0.500    16    5    1    22    0.727
9 2012-13    10    5    3 18    0.556 12    8    4    24    0.500

The variables are 1st half wins (w1), losses (l1), ties (t1), total games (g1) and win percentage (win1), along with the corresponding data from the 2nd half of each season.

> #boxplots of wins to visualize the difference
> boxplot(win1,win2, main="Second Half Surge", ylab ="Win Percentage", names=c("1st Half", "2nd Half"))


> #paired t-test to compare the means; use a paired test because you have two measurements on the same "subject"
> t.test(win1,win2, paired=T)

Paired t-test

data:  win1 and win2
t = -3.1795, df = 8, p-value = 0.01301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.20607414 -0.03281474
sample estimates:
mean of the differences
             -0.1194444

> #p-value is less than .05, so we have achieved statistical significance; note the average difference in the means is about 12%

> #proportions test of all seasons
> #create objects of wins and total games
> w1.total = sum(w1)
> g1.total = sum(g1)
> w2.total = sum(w2)
> g2.total = sum(g2)

> #now run the proportions test with vectors, using the objects above
> prop.test(c(w1.total,w2.total), c(g1.total,g2.total))

2-sample test for equality of proportions with continuity
correction

data:  c(w1.total, w2.total) out of c(g1.total, g2.total)
X-squared = 4.9931, df = 1, p-value = 0.02545
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.21872805 -0.01394102
sample estimates:
   prop 1    prop 2
0.5393258 0.6556604

Another significant p-value.  North Dakota is winning almost 54% of their games in the 1st half of a season, but nearly 66% in the second half.  Take out those gut-wrenching Frozen Four losses and the results would be quite impressive.

On a side note, this season has been no different.  I made the trip last October 19th to Oxford, OH in full Sioux regalia only to watch then number 1 ranked Miami (OH) mercilessly shred North Dakota.  A couple of weeks ago, we returned the favor, ripping the lungs out of 'em on consecutive nights.  We are in the hunt once again and the pulse quickens as we go in for the kill.  I only hope we don't see Boston College at any point in the near future.

CL






Wednesday, February 5, 2014

An Inconvenient Statistic

As I sit here waiting on more frigid temperatures subsequent to another 10 inches of snow, suffering from metastatic cabin fever, I can't help but ponder what I can do examine global warming/climate change.  Well, as luck would have it, R has the tools to explore this controversy.  Using two packages, vars and forecast, I will see if I should be purchasing carbon offsets or continue with a life of conspicuous consumption, oblivious to the consequences of my actions.

The concept is to find data on man-made carbon emissions and global surface temperatures.  Then, using vector autoregression to identify the proper number of lags to put into a granger causality model.  I will not get into any theory here, but you can see a discussion of granger causality in my very first post where I showed how to solve the age-old mystery of what comes first, the chicken or the egg (tongue firmly planted in cheek).

It is important to point out that two prior papers have shown no causal linkage between CO2 emissions and surface temperatures (Triacca, 2005 & an unpublished manuscript from Bilancia/Vitale).  In essence, past observations of CO2 concentrations do not improve the statistical predictions of current surface temperatures.  Be that as it may, I will attempt to duplicate such an analysis, giving any adventurous data scientist the tools and techniques to dig into this conundrum on their own.

Where can we find the data?  Global CO emission estimates can be found at the Carbon Dioxide Information Analysis Center (CDIAC) at the following website -  http://cdiac.ornl.gov/.  You can download data of total emissions of fossil fuel combustion and cement manufacture.  Surface temperature takes some detective work, but a clever soul can find it at the website of the UK Met Office Hadley Centre, part of the climate research center at the University of East Anglia, website - http://www.metoffice.gov.uk/hadobs/hadcrut4/ .  An anomaly is calculated as the difference between the average annual surface temperature versus the average of the reference years, 1961 - 1990.

The data have common years from 1850 until 2010 and I downloaded and put it into a .csv for import into R.  Now, it's on to the code!

> require(forecast)
> require(vars)
> var.data = read.csv(file.choose())
> head(var.data)
  Year CO2   Temp
1 1850  54 -0.374
2 1851  54 -0.219
3 1852  57 -0.223
4 1853  59 -0.268
5 1854  69 -0.243
6 1855  71 -0.264

> #put data into a time series
> carbon.ts = ts(CO2, frequency=1, start=c(1850), end=c(2010))
> temp.ts = ts(Temp, frequency=1, start=c(1850), end=c(2010))
#subset the data from 1900 until 2010
> surfacetemp = window(temp.ts, start=c(1900), end=c(2010))
> co2 = window(carbon.ts, start=c(1900), end=c(2010))
> climate.ts = cbind(co2, surfacetemp)
> plot(climate.ts)

















> #determine stationarity and number of lags to achieve stationarity
> ndiffs(co2, alpha = 0.05, test = c("adf"))
[1] 1
> ndiffs(surfacetemp, alpha = 0.05, test = c("adf"))
[1] 1

Using the adf test above in the ndiffs command of the forecast package, we can see that a 1st difference will allow us to achieve stationarity, which is necessary for vector autoregression and granger causality.

> #difference to achieve stationarity
> d.co2 = diff(co2)
> d.temp = diff(surfacetemp)

> #again, we need a mts class dataframe
> climate2.ts = cbind(d.co2, d.temp)
> plot(climate2.ts)


















> #determine the optimal number of lags for vector autoregression
> VARselect(climate2.ts, lag.max=10) $selection

AIC(n)  HQ(n)  SC(n) FPE(n)
     7          3         1        7

I find that the above divergence in the tests for optimal VAR modeling is quite common.  Now, one can peruse the literature for what is the best statistical test to determine optimal lag length, but I like to use brute force and ignorance and try all of the above (i.e. lags 1, 3 and 7).

> #vector autoregression with lag1
> var = VAR(climate2.ts, p=1)

It is important now to test for serial autocorrelation in the model residuals and below is for the Portmanteau test (several options in the vars package are available).

> serial.test(var, lags.pt=10, type="PT.asymptotic")

Portmanteau Test (asymptotic)

data:  Residuals of VAR object var
Chi-squared = 55.4989, df = 36, p-value = 0.01996

#The null hypothesis is no serial correlation, so we can reject it with extreme prejudice...on to var3
> var3 = VAR(climate2.ts, p=3)
> serial.test(var3, lags.pt=10, type="PT.asymptotic")

Portmanteau Test (asymptotic)

data:  Residuals of VAR object var3
Chi-squared = 36.1256, df = 28, p-value = 0.1394

That is more like it.  You can review the details of the var model, in this case temperature, if you so choose:

> summary(var3, equation="d.temp")

VAR Estimation Results:
=========================
Endogenous variables: d.co2, d.temp
Deterministic variables: const
Sample size: 107
Log Likelihood: -548.435
Roots of the characteristic polynomial:
0.7812 0.7265 0.7265 0.6491 0.5846 0.5846
Call:
VAR(y = climate2.ts, p = 3)


Estimation results for equation d.temp:
=======================================
d.temp = d.co2.l1 + d.temp.l1 + d.co2.l2 + d.temp.l2 + d.co2.l3 + d.temp.l3 + const

               Estimate     Std. Error        t value   Pr(>|t|)  
d.co2.l1     7.603e-05  1.014e-04        0.749    0.455372  
d.temp.l1   -4.103e-01  9.448e-02     -4.343    3.37e-05 ***
d.co2.l2    -2.152e-05  1.115e-04      -0.193    0.847339  
d.temp.l2   -3.922e-01  9.544e-02     -4.109    8.15e-05 ***
d.co2.l3     7.905e-05  1.041e-04       0.759    0.449465  
d.temp.l3   -3.366e-01  9.263e-02     -3.633   0.000444 ***
const        7.539e-03  1.340e-02        0.563    0.574960  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Residual standard error: 0.1014 on 100 degrees of freedom
Multiple R-Squared: 0.254, Adjusted R-squared: 0.2093
F-statistic: 5.676 on 6 and 100 DF,  p-value: 4.15e-05



Covariance matrix of residuals:
              d.co2           d.temp
d.co2   10972.588    -1.28920
d.temp        -1.289     0.01028

Correlation matrix of residuals:
             d.co2       d.temp
d.co2   1.0000     -0.1214
d.temp -0.1214     1.0000

> #does co2 granger cause temperature
> grangertest(d.temp ~ d.co2, order=3)

Granger causality test

Model 1: d.temp ~ Lags(d.temp, 1:3) + Lags(d.co2, 1:3)
Model 2: d.temp ~ Lags(d.temp, 1:3)
  Res.Df   Df        F          Pr(>F)
1    100              
2    103   -3    0.5064     0.6787

> #Clearly the model is not significant, so we can say that carbon emissions do not granger-cause surface temperatures.

> #does temperature granger cause co2
> grangertest(d.co2 ~ d.temp, order =3)

Granger causality test

Model 1: d.co2 ~ Lags(d.co2, 1:3) + Lags(d.temp, 1:3)
Model 2: d.co2 ~ Lags(d.co2, 1:3)
  Res.Df    Df         F        Pr(>F)
1    100              
2    103     -3    0.7799    0.5079

> #try again using lag 7
> grangertest(d.temp ~ d.co2, order=7)

Granger causality test

Model 1: d.temp ~ Lags(d.temp, 1:7) + Lags(d.co2, 1:7)
Model 2: d.temp ~ Lags(d.temp, 1:7)
  Res.Df    Df         F        Pr(>F)
1     88              
2     95      -7    0.5817    0.7691

Again, nothing significant using lag 7.  So, using this data and the econometric techniques spelled out above, it seems there is no causal effect (statistically speaking) between fossil fuel emissions and global surface temperatures.  Certainly, this is not the final word on the matter as there is much measurement error in the data that the stewards have attempted to account for.

On a side note, we can use vars for predictions and forecast for time series plots of the predicted values.

> predict(var3, n.ahead=6, ci=0.95)

$d.co2
         fcst                 lower            upper             CI
[1,] 202.5888      -2.717626     407.8953     205.3065
[2,] 110.3385  -105.847948     326.5249     216.1864
[3,] 192.1802    -26.160397     410.5207     218.3406
[4,] 152.5464    -74.948000     380.0408     227.4944
[5,] 108.4343  -122.198058     339.0666     230.6323
[6,] 123.9001  -107.882219     355.6824     231.7823

$d.temp
             fcst                    lower            upper               CI
[1,]  0.026737000    -0.1719770    0.2254510    0.1987140
[2,] -0.057081637    -0.2731569    0.1589936    0.2160753
[3,]  0.040419451    -0.1803409    0.2611798    0.2207603
[4,]  0.032591047    -0.1893108    0.2544929    0.2219019
[5,]  0.013708836    -0.2143756    0.2417933    0.2280844
[6,] -0.004319714   -0.2324070    0.2237675    0.2280873

> fcst = forecast(var3)
> plot(fcst)


















So what can we conclude from this exercise?  Well, let's look to the good Doctor, Hunter S. Thompson for some philosophical insight.  He would likely advise us...

"res ipsa locquitur"

References:

BILANCIA, MASSIMO, and DOMENICO VITALE. "GRANGER CAUSALITY ANALYSIS OF BIVARIATE CLIMATIC TIME SERIES: A NOTE ON THE ROLE OF CO2 EMISSIONS IN GLOBAL CLIMATE WARMING."

Morice, C. P., J. J. Kennedy, N. A. Rayner, and P. D. Jones (2012), Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 dataset, J. Geophys. Res., 117, D08101, doi:10.1029/2011JD017187.

Triacca, U, Is Granger causality analysis appropriate to investigate the relationship between atmospheric concentration of carbon dioxide and global surface air temperature?, Theoretical and Applied Climatology, 81, 133-135