I've almost achieved my goal of a minimum of 100 quality responses. I will keep the survey open for a couple more days, so if you haven't responded, please feel free to have a crack at it.

Regards,

Cory

https://www.surveymonkey.com/s/7DY8DYH

# Fear and Loathing in Data Science

A Savage Journey to the Heart of Big Data

## Saturday, June 28, 2014

## Sunday, June 22, 2014

### Survey - Voice of the Reader

Dear Blog Readers,

I have numerous topics I could discuss in future blog posts. However, I would like to capture those topics that most interest you. Therefore, I've put together a short survey (see the link below) for everyone to weigh-in on the subject.

I will keep this open for a couple of weeks and once closed, I will publish the results and make the data public.

Thank you in advance,

Cory

https://www.surveymonkey.com/s/7DY8DYH

I have numerous topics I could discuss in future blog posts. However, I would like to capture those topics that most interest you. Therefore, I've put together a short survey (see the link below) for everyone to weigh-in on the subject.

I will keep this open for a couple of weeks and once closed, I will publish the results and make the data public.

Thank you in advance,

Cory

https://www.surveymonkey.com/s/7DY8DYH

## Wednesday, May 14, 2014

### The NFL Reloads: Using R to Have a Look

# NFL Draft 2014

It
can't be easy being a Jacksonville Jaguars fan. It seems that
management has learned nothing from the Blaine Gabbert debacle. By
that I mean I'm not impressed with their first round pick Blake
Bortles. Bucky Brooks of NFL.com called him a “developmental
prospect”. Developmental? Is that what you want from the third
overall pick? I hate to say it but I have to agree with Skip Bayless
that Bortles is the new Gabbert in J-Ville. We shall see soon enough
if I will eat these words!

At
any rate, I've downloaded draft data from the web and you can explore
the draft at your own leisure. I scrapped it off of wikipedia at this
link: http://en.wikipedia.org/wiki/2014_NFL_Draft

I
put together some code to get you on your way to exploring

`attach`

`(`

`nfldraft`

`)`

`str`

`(`

`nfldraft`

`)`

`## 'data.frame': 256 obs. of 8 variables:`

`## $ round : int 1 1 1 1 1 1 1 1 1 1 ...`

`## $ pick : int 1 2 3 4 5 6 7 8 9 10 ...`

`## $ team : Factor w/ 32 levels "Arizona Cardinals",..: 13 29 15 4 23 2 30 8 18 11 ...`

`## $ player : Factor w/ 256 levels "Aaron Colvin ",..: 96 89 25 204 147 97 178 131 12 84 ...`

`## $ position : Factor w/ 18 levels "C","CB","DE",..: 3 11 13 18 9 11 18 2 9 17 ...`

`## $ college : Factor w/ 113 levels "Alabama","Arizona",..: 86 6 99 15 12 95 95 71 100 62 ...`

`## $ conference: Factor w/ 24 levels "ACC","Big 12",..: 21 21 24 1 11 21 21 2 17 1 ...`

`## $ notes : Factor w/ 88 levels "","from Arizona[R1 - 5]",..: 1 86 1 20 1 1 1 51 9 1 ...`

`names`

`(`

`nfldraft`

`)`

`## [1] "round" "pick" "team" "player" "position"`

`## [6] "college" "conference" "notes"`

`mytable`

`=`

`table`

`(`

`position, round`

`)`

`#create a table of player position and round`

`selected`

`mytable`

`## round`

`## position 1 2 3 4 5 6 7`

`## C 0 1 2 2 2 2 1`

`## CB 5 1 2 9 4 7 5`

`## DE 2 3 3 2 5 3 4`

`## DT 2 3 4 3 3 1 4`

`## FB 0 0 0 0 0 1 1`

`## FS 0 0 0 0 0 0 2`

`## G 0 1 6 1 4 2 0`

`## K 0 0 0 0 0 0 2`

`## LB 5 3 3 5 6 2 7`

`## OLB 0 0 0 0 2 1 0`

`## OT 5 4 3 1 1 3 4`

`## P 0 0 0 0 0 1 0`

`## QB 3 2 0 2 2 5 0`

`## RB 0 3 5 5 0 4 2`

`## S 4 1 2 4 3 1 0`

`## SS 0 0 0 0 0 1 2`

`## TE 1 3 3 0 1 0 2`

`## WR 5 7 3 6 3 5 5`

`margin.table`

`(`

`mytable`

`)`

`#generic table with sum...not very useful`

`## [1] 256`

`margin.table`

`(`

`mytable,`

`1`

`)`

`#sum of positions selected`

`## position`

`## C CB DE DT FB FS G K LB OLB OT P QB RB S SS TE WR`

`## 10 33 22 20 2 2 14 2 31 3 21 1 14 19 15 3 10 34`

`margin.table`

`(`

`mytable,`

`2`

`)`

`#sum of players selected by round`

`## round`

`## 1 2 3 4 5 6 7`

`## 32 32 36 40 36 39 41`

`mosaicplot`

`(`

`mytable, main`

`=`

`"Position by Round"`

`, xlab`

`=`

`"Position"`

`,`

`ylab`

`=`

`"Round"`

`,`

`cex.axis`

`=`

`1`

`)`

`#you can use mosaicplot to graphically`

`represent the data in the table`

What
if you want to drill-down on a specific team or the like? Here is
some simple code to examine a specific team, in this case the Colts:

```
```

`colts`

`=`

`nfldraft`

`[`

`which`

`(`

`nfldraft`

`$`

`team`

`==`

`"Indianapolis Colts"`

`)`

`,`

`]`

`#select the`

`rows `

of data unique to the Colts

```
```

`colts.table`

`=`

`table`

`(`

`colts`

`$`

`round, colts`

`$`

`position`

`)`

`margin.table`

`(`

`colts.table,`

`2`

`)`

`##`

`## C CB DE DT FB FS G K LB OLB OT P QB RB S SS TE WR`

`## 0 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1`

`barplot`

`(`

`colts.table`

`)`

There
you have it. With some quick modifications to the above, you could
produce tables examining players selected by conference, college etc.

## Thursday, April 24, 2014

### Analysis of Copper and Equity Prices on the Quandl.com blog

http://blog.quandl.com/blog/mythbusting-dr-copper/

## Monday, March 17, 2014

### MoneyPuck - Best subsets regression of NHL teams

Spring is at hand and it is a time of renewal, March Madness and to settle scores in the NHL. There are many scores to be settled: Flyers vs. Penguins, Blackhawks vs. Red Wings, Leafs vs. Habs and pretty much everyone else vs. the Bruins. Like any fire and brimstone hockey fan, clutching my old testament (a well-worn VHS copy of Slapshot), I will be watching with passionate interest as the NHL unleashes a hellish intensity as it comes down the home stretch. What does it take to win? I stumbled upon team statistics on nhl.com and decided to have a look with some basic linear regression. I created a simple .csv file of team stats with 24 variables for each of the 30 teams as of the morning of March 16.

[1] "Team" "gp" "wins"

[4] "losses" "ot_losses" "points"

[7] "reg_ot" "home_row" "road_row"

[10] "point_percent" "g_game" "ga_game"

[13] "fiveon5_ratio" "play_percent" "pk_percent"

[16] "shots_game" "sa_game" "sfirst_percent"

[19] "trfirst_percent" "lfirst_percent" "lsecond_percent"

[22] "win_outshoot" "win_outshot" "fo_percent"

[25] "pim"

The dependent variable here will be "point_percent", which is the percentage points a team achieves per game played. Since teams have played different numbers of games and because of the NHL point system (I will not get into that here) it is the best reflection of team performance. I'll discuss other variables in a minute.

Let's get rid of unneeded variables...

> nhl.sub = nhl.3.16[ ,c(1,10:25)]

> #examine a correlation plot

> require(corrplot)

> nhl.corr = cor(nhl.sub[2:17])

> corrplot(nhl.corr, method="ellipse")

Setting aside goals scored and allowed per game (g_game, ga_game), we see that we have high correlation with the ratio of goals scored versus opponents when teams are at full strength (fiveon5_ratio), winning percentage when scoring first (sfirst_percent), winning when outshooting an opponent (win_outshoot) and when outshot (win_outshot). I was surprised that penalty minutes per game (pim) was not correlated with anything.

> #re-make nhl.sub with no goals or goals allowed per game

> nhl.sub = nhl.3.16[ ,c(10,13:25)]

> require(leaps)

> #use the leaps package for best subsets

> #the package defaults to 8 subsets, but I set to 10 for demonstration purposes

> bestsubsets = regsubsets(point_percent~., nhl.sub, nvmax=10)

> #create an object of regsubsets for further evaluation and model selection

> reg.summary = summary(bestsubsets)

> names(reg.summary)

[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat"

[8] "obj"

> #plot of residual sum of squares

> plot(reg.summary$rss ,xlab="Number of Variables ",ylab="RSS", type="l")

> #looks like somewhere between 4 and 7 variables would be an optimal model...plot Mallow's Cp next

> plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type="l")

> #5 variables? let's confirm...

> which.min(reg.summary$cp )

[1] 5

> plot(bestsubsets ,scale="Cp")

The way to read this plot is to look at the top (lowest cp) and a black box corresponds to a variable on the x axis. So here we see that the 5 on 5 ratio, power play percentage, penalty kill percentage, percentage of wins when scoring first and percentage of wins when trailing after the first period make the cut.

> #examine the coefficients

> coef(bestsubsets ,5)

(Intercept) fiveon5_ratio play_percent pk_percent

-0.220482142 0.180024350 0.004574047 0.002724056

sfirst_percent trfirst_percent

0.303930780 0.277814196

Here is some additional analysis I did to look at an ANOVA table and examine residuals i.e. confirm our linear model assumptions, I won't bore you with the details:

> nhl.lm = lm(point_percent ~ fiveon5_ratio + play_percent + pk_percent + sfirst_percent + trfirst_percent)

> anova(nhl.lm)

> summary(nhl.lm)

> plot(nhl.lm)

> #create a plot of actual versus predicted with team names for your viewing pleasure

> nhl.3.16["pred"] = NA

> nhl.3.16$pred = predict(nhl.lm)

> require(ggplot2)

> p = ggplot(nhl.3.16, aes(x=pred, y=point_percent, label=Team))

> p + geom_point() + geom_text(size=3, hjust=-.1, vjust=0)

By the way, the adjusted R-squared was 0.9614. Interesting how we see the elite teams like St. Louis and Anaheim clustering together and the has-been teams clustering as well. Teams like Tampa, Washington and Dallas will all be fighting it out for those playoff spots to the glorious and bitter end.

Mahalo,

Cory

[1] "Team" "gp" "wins"

[4] "losses" "ot_losses" "points"

[7] "reg_ot" "home_row" "road_row"

[10] "point_percent" "g_game" "ga_game"

[13] "fiveon5_ratio" "play_percent" "pk_percent"

[16] "shots_game" "sa_game" "sfirst_percent"

[19] "trfirst_percent" "lfirst_percent" "lsecond_percent"

[22] "win_outshoot" "win_outshot" "fo_percent"

[25] "pim"

The dependent variable here will be "point_percent", which is the percentage points a team achieves per game played. Since teams have played different numbers of games and because of the NHL point system (I will not get into that here) it is the best reflection of team performance. I'll discuss other variables in a minute.

Let's get rid of unneeded variables...

> nhl.sub = nhl.3.16[ ,c(1,10:25)]

> #examine a correlation plot

> require(corrplot)

> nhl.corr = cor(nhl.sub[2:17])

> corrplot(nhl.corr, method="ellipse")

Setting aside goals scored and allowed per game (g_game, ga_game), we see that we have high correlation with the ratio of goals scored versus opponents when teams are at full strength (fiveon5_ratio), winning percentage when scoring first (sfirst_percent), winning when outshooting an opponent (win_outshoot) and when outshot (win_outshot). I was surprised that penalty minutes per game (pim) was not correlated with anything.

> #re-make nhl.sub with no goals or goals allowed per game

> nhl.sub = nhl.3.16[ ,c(10,13:25)]

> require(leaps)

> #use the leaps package for best subsets

> #the package defaults to 8 subsets, but I set to 10 for demonstration purposes

> bestsubsets = regsubsets(point_percent~., nhl.sub, nvmax=10)

> #create an object of regsubsets for further evaluation and model selection

> reg.summary = summary(bestsubsets)

> names(reg.summary)

[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat"

[8] "obj"

> #plot of residual sum of squares

> plot(reg.summary$rss ,xlab="Number of Variables ",ylab="RSS", type="l")

> #looks like somewhere between 4 and 7 variables would be an optimal model...plot Mallow's Cp next

> plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type="l")

> #5 variables? let's confirm...

> which.min(reg.summary$cp )

[1] 5

> plot(bestsubsets ,scale="Cp")

The way to read this plot is to look at the top (lowest cp) and a black box corresponds to a variable on the x axis. So here we see that the 5 on 5 ratio, power play percentage, penalty kill percentage, percentage of wins when scoring first and percentage of wins when trailing after the first period make the cut.

> #examine the coefficients

> coef(bestsubsets ,5)

(Intercept) fiveon5_ratio play_percent pk_percent

-0.220482142 0.180024350 0.004574047 0.002724056

sfirst_percent trfirst_percent

0.303930780 0.277814196

Here is some additional analysis I did to look at an ANOVA table and examine residuals i.e. confirm our linear model assumptions, I won't bore you with the details:

> nhl.lm = lm(point_percent ~ fiveon5_ratio + play_percent + pk_percent + sfirst_percent + trfirst_percent)

> anova(nhl.lm)

> summary(nhl.lm)

> plot(nhl.lm)

> #create a plot of actual versus predicted with team names for your viewing pleasure

> nhl.3.16["pred"] = NA

> nhl.3.16$pred = predict(nhl.lm)

> require(ggplot2)

> p = ggplot(nhl.3.16, aes(x=pred, y=point_percent, label=Team))

> p + geom_point() + geom_text(size=3, hjust=-.1, vjust=0)

By the way, the adjusted R-squared was 0.9614. Interesting how we see the elite teams like St. Louis and Anaheim clustering together and the has-been teams clustering as well. Teams like Tampa, Washington and Dallas will all be fighting it out for those playoff spots to the glorious and bitter end.

Mahalo,

Cory

## Saturday, March 8, 2014

### The "Fighting Sioux" Surge

Even a casual fan of North Dakota Hockey will notice that in the era of Coach Dave Hakstol, the team seems to perform better in the second half of the season than the first. For the rabid fans of the team like myself, it has become a horrible and interesting fact of life. By mid-December we are inevitably calling for Hakstol's head on a pike only to be whipped into a blood-thirsty frenzy as the team dominates the ice, slaughtering opponents with buzz-saw precision. The WCHA tourney championships and trips to the Frozen Four will attest to that. Let's not get into the team's Frozen Four issues...just too painful to write about, too many emotional scars.

On February 28th, , Dave Berger posted on his siouxsports.com blog an excellent analysis of the surge, "The Second-Half Surge: Math or Myth?"

http://blog.siouxsports.com/2014/02/28/the-second-half-surge-math-or-myth-2/

Using the data Dave provided in his blog, I couldn't help but subject it to some analysis in R. No fancy econometrics, big data or machine learning here, just some good old fashioned statistical tests on a small data set. The data starts in the 2004-05 season and goes through last year. In 7 of the 9 years, the team has a better winning percentage in the second half.

year w1 l1 t1 g1 win1 w2 l2 t2 g2 win2

1 2004-05 13 7 2 22 0.591 12 8 3 23 0.522

2 2005-06 13 8 1 22 0.591 16 8 0 24 0.667

3 2006-07 9 10 1 20 0.450 15 4 4 23 0.652

4 2007-08 9 7 1 17 0.529 19 4 3 26 0.731

5 2008-09 9 10 1 20 0.450 15 5 3 23 0.652

6 2009-10 9 6 3 18 0.500 16 7 2 25 0.640

7 2010-11 14 5 2 21 0.667 18 3 1 22 0.818

8 2011-12 10 8 2 20 0.500 16 5 1 22 0.727

9 2012-13 10 5 3 18 0.556 12 8 4 24 0.500

The variables are 1st half wins (w1), losses (l1), ties (t1), total games (g1) and win percentage (win1), along with the corresponding data from the 2nd half of each season.

> #boxplots of wins to visualize the difference

> boxplot(win1,win2, main="Second Half Surge", ylab ="Win Percentage", names=c("1st Half", "2nd Half"))

> #paired t-test to compare the means; use a paired test because you have two measurements on the same "subject"

> t.test(win1,win2, paired=T)

Paired t-test

data: win1 and win2

t = -3.1795, df = 8, p-value = 0.01301

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.20607414 -0.03281474

sample estimates:

mean of the differences

-0.1194444

> #p-value is less than .05, so we have achieved statistical significance; note the average difference in the means is about 12%

> #proportions test of all seasons

> #create objects of wins and total games

> w1.total = sum(w1)

> g1.total = sum(g1)

> w2.total = sum(w2)

> g2.total = sum(g2)

> #now run the proportions test with vectors, using the objects above

> prop.test(c(w1.total,w2.total), c(g1.total,g2.total))

2-sample test for equality of proportions with continuity

correction

data: c(w1.total, w2.total) out of c(g1.total, g2.total)

X-squared = 4.9931, df = 1, p-value = 0.02545

alternative hypothesis: two.sided

95 percent confidence interval:

-0.21872805 -0.01394102

sample estimates:

prop 1 prop 2

0.5393258 0.6556604

Another significant p-value. North Dakota is winning almost 54% of their games in the 1st half of a season, but nearly 66% in the second half. Take out those gut-wrenching Frozen Four losses and the results would be quite impressive.

On a side note, this season has been no different. I made the trip last October 19th to Oxford, OH in full Sioux regalia only to watch then number 1 ranked Miami (OH) mercilessly shred North Dakota. A couple of weeks ago, we returned the favor, ripping the lungs out of 'em on consecutive nights. We are in the hunt once again and the pulse quickens as we go in for the kill. I only hope we don't see Boston College at any point in the near future.

CL

On February 28th, , Dave Berger posted on his siouxsports.com blog an excellent analysis of the surge, "The Second-Half Surge: Math or Myth?"

http://blog.siouxsports.com/2014/02/28/the-second-half-surge-math-or-myth-2/

Using the data Dave provided in his blog, I couldn't help but subject it to some analysis in R. No fancy econometrics, big data or machine learning here, just some good old fashioned statistical tests on a small data set. The data starts in the 2004-05 season and goes through last year. In 7 of the 9 years, the team has a better winning percentage in the second half.

year w1 l1 t1 g1 win1 w2 l2 t2 g2 win2

1 2004-05 13 7 2 22 0.591 12 8 3 23 0.522

2 2005-06 13 8 1 22 0.591 16 8 0 24 0.667

3 2006-07 9 10 1 20 0.450 15 4 4 23 0.652

4 2007-08 9 7 1 17 0.529 19 4 3 26 0.731

5 2008-09 9 10 1 20 0.450 15 5 3 23 0.652

6 2009-10 9 6 3 18 0.500 16 7 2 25 0.640

7 2010-11 14 5 2 21 0.667 18 3 1 22 0.818

8 2011-12 10 8 2 20 0.500 16 5 1 22 0.727

9 2012-13 10 5 3 18 0.556 12 8 4 24 0.500

The variables are 1st half wins (w1), losses (l1), ties (t1), total games (g1) and win percentage (win1), along with the corresponding data from the 2nd half of each season.

> #boxplots of wins to visualize the difference

> boxplot(win1,win2, main="Second Half Surge", ylab ="Win Percentage", names=c("1st Half", "2nd Half"))

> #paired t-test to compare the means; use a paired test because you have two measurements on the same "subject"

> t.test(win1,win2, paired=T)

Paired t-test

data: win1 and win2

t = -3.1795, df = 8, p-value = 0.01301

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.20607414 -0.03281474

sample estimates:

mean of the differences

-0.1194444

> #p-value is less than .05, so we have achieved statistical significance; note the average difference in the means is about 12%

> #proportions test of all seasons

> #create objects of wins and total games

> w1.total = sum(w1)

> g1.total = sum(g1)

> w2.total = sum(w2)

> g2.total = sum(g2)

> #now run the proportions test with vectors, using the objects above

> prop.test(c(w1.total,w2.total), c(g1.total,g2.total))

2-sample test for equality of proportions with continuity

correction

data: c(w1.total, w2.total) out of c(g1.total, g2.total)

X-squared = 4.9931, df = 1, p-value = 0.02545

alternative hypothesis: two.sided

95 percent confidence interval:

-0.21872805 -0.01394102

sample estimates:

prop 1 prop 2

0.5393258 0.6556604

Another significant p-value. North Dakota is winning almost 54% of their games in the 1st half of a season, but nearly 66% in the second half. Take out those gut-wrenching Frozen Four losses and the results would be quite impressive.

On a side note, this season has been no different. I made the trip last October 19th to Oxford, OH in full Sioux regalia only to watch then number 1 ranked Miami (OH) mercilessly shred North Dakota. A couple of weeks ago, we returned the favor, ripping the lungs out of 'em on consecutive nights. We are in the hunt once again and the pulse quickens as we go in for the kill. I only hope we don't see Boston College at any point in the near future.

CL

## Wednesday, February 5, 2014

### An Inconvenient Statistic

As I sit here waiting on more frigid temperatures subsequent to another 10 inches of snow, suffering from metastatic cabin fever, I can't help but ponder what I can do examine global warming/climate change. Well, as luck would have it, R has the tools to explore this controversy. Using two packages, vars and forecast, I will see if I should be purchasing carbon offsets or continue with a life of conspicuous consumption, oblivious to the consequences of my actions.

The concept is to find data on man-made carbon emissions and global surface temperatures. Then, using vector autoregression to identify the proper number of lags to put into a granger causality model. I will not get into any theory here, but you can see a discussion of granger causality in my very first post where I showed how to solve the age-old mystery of what comes first, the chicken or the egg (tongue firmly planted in cheek).

It is important to point out that two prior papers have shown no causal linkage between CO2 emissions and surface temperatures (Triacca, 2005 & an unpublished manuscript from Bilancia/Vitale). In essence, past observations of CO2 concentrations do not improve the statistical predictions of current surface temperatures. Be that as it may, I will attempt to duplicate such an analysis, giving any adventurous data scientist the tools and techniques to dig into this conundrum on their own.

Where can we find the data? Global CO emission estimates can be found at the Carbon Dioxide Information Analysis Center (CDIAC) at the following website - http://cdiac.ornl.gov/. You can download data of total emissions of fossil fuel combustion and cement manufacture. Surface temperature takes some detective work, but a clever soul can find it at the website of the UK Met Office Hadley Centre, part of the climate research center at the University of East Anglia, website - http://www.metoffice.gov.uk/hadobs/hadcrut4/ . An anomaly is calculated as the difference between the average annual surface temperature versus the average of the reference years, 1961 - 1990.

The data have common years from 1850 until 2010 and I downloaded and put it into a .csv for import into R. Now, it's on to the code!

> require(forecast)

> require(vars)

> var.data = read.csv(file.choose())

> head(var.data)

Year CO2 Temp

1 1850 54 -0.374

2 1851 54 -0.219

3 1852 57 -0.223

4 1853 59 -0.268

5 1854 69 -0.243

6 1855 71 -0.264

> #put data into a time series

> carbon.ts = ts(CO2, frequency=1, start=c(1850), end=c(2010))

> temp.ts = ts(Temp, frequency=1, start=c(1850), end=c(2010))

#subset the data from 1900 until 2010

> surfacetemp = window(temp.ts, start=c(1900), end=c(2010))

> co2 = window(carbon.ts, start=c(1900), end=c(2010))

> climate.ts = cbind(co2, surfacetemp)

> plot(climate.ts)

> #determine stationarity and number of lags to achieve stationarity

> ndiffs(co2, alpha = 0.05, test = c("adf"))

[1] 1

> ndiffs(surfacetemp, alpha = 0.05, test = c("adf"))

[1] 1

Using the adf test above in the ndiffs command of the forecast package, we can see that a 1st difference will allow us to achieve stationarity, which is necessary for vector autoregression and granger causality.

> #difference to achieve stationarity

> d.co2 = diff(co2)

> d.temp = diff(surfacetemp)

> #again, we need a mts class dataframe

> climate2.ts = cbind(d.co2, d.temp)

> plot(climate2.ts)

> #determine the optimal number of lags for vector autoregression

> VARselect(climate2.ts, lag.max=10) $selection

AIC(n) HQ(n) SC(n) FPE(n)

7 3 1 7

I find that the above divergence in the tests for optimal VAR modeling is quite common. Now, one can peruse the literature for what is the best statistical test to determine optimal lag length, but I like to use brute force and ignorance and try all of the above (i.e. lags 1, 3 and 7).

> #vector autoregression with lag1

> var = VAR(climate2.ts, p=1)

It is important now to test for serial autocorrelation in the model residuals and below is for the Portmanteau test (several options in the vars package are available).

> serial.test(var, lags.pt=10, type="PT.asymptotic")

Portmanteau Test (asymptotic)

data: Residuals of VAR object var

Chi-squared = 55.4989, df = 36, p-value = 0.01996

VAR Estimation Results:

=========================

Endogenous variables: d.co2, d.temp

Deterministic variables: const

Sample size: 107

Log Likelihood: -548.435

Roots of the characteristic polynomial:

0.7812 0.7265 0.7265 0.6491 0.5846 0.5846

Call:

VAR(y = climate2.ts, p = 3)

Estimation results for equation d.temp:

=======================================

d.temp = d.co2.l1 + d.temp.l1 + d.co2.l2 + d.temp.l2 + d.co2.l3 + d.temp.l3 + const

Estimate Std. Error t value Pr(>|t|)

d.co2.l1 7.603e-05 1.014e-04 0.749 0.455372

d.temp.l1 -4.103e-01 9.448e-02 -4.343 3.37e-05 ***

d.co2.l2 -2.152e-05 1.115e-04 -0.193 0.847339

d.temp.l2 -3.922e-01 9.544e-02 -4.109 8.15e-05 ***

d.co2.l3 7.905e-05 1.041e-04 0.759 0.449465

d.temp.l3 -3.366e-01 9.263e-02 -3.633 0.000444 ***

const 7.539e-03 1.340e-02 0.563 0.574960

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1014 on 100 degrees of freedom

Multiple R-Squared: 0.254, Adjusted R-squared: 0.2093

F-statistic: 5.676 on 6 and 100 DF, p-value: 4.15e-05

Covariance matrix of residuals:

d.co2 d.temp

d.co2 10972.588 -1.28920

d.temp -1.289 0.01028

Correlation matrix of residuals:

d.co2 d.temp

d.co2 1.0000 -0.1214

d.temp -0.1214 1.0000

> #does co2 granger cause temperature

> grangertest(d.temp ~ d.co2, order=3)

Granger causality test

Model 1: d.temp ~ Lags(d.temp, 1:3) + Lags(d.co2, 1:3)

Model 2: d.temp ~ Lags(d.temp, 1:3)

Res.Df Df F Pr(>F)

1 100

2 103 -3 0.5064 0.6787

> #Clearly the model is not significant, so we can say that carbon emissions do not granger-cause surface temperatures.

> #does temperature granger cause co2

> grangertest(d.co2 ~ d.temp, order =3)

Granger causality test

Model 1: d.co2 ~ Lags(d.co2, 1:3) + Lags(d.temp, 1:3)

Model 2: d.co2 ~ Lags(d.co2, 1:3)

Res.Df Df F Pr(>F)

1 100

2 103 -3 0.7799 0.5079

> #try again using lag 7

> grangertest(d.temp ~ d.co2, order=7)

Granger causality test

Model 1: d.temp ~ Lags(d.temp, 1:7) + Lags(d.co2, 1:7)

Model 2: d.temp ~ Lags(d.temp, 1:7)

Res.Df Df F Pr(>F)

1 88

2 95 -7 0.5817 0.7691

Again, nothing significant using lag 7. So, using this data and the econometric techniques spelled out above, it seems there is no causal effect (statistically speaking) between fossil fuel emissions and global surface temperatures. Certainly, this is not the final word on the matter as there is much measurement error in the data that the stewards have attempted to account for.

On a side note, we can use vars for predictions and forecast for time series plots of the predicted values.

> predict(var3, n.ahead=6, ci=0.95)

$d.co2

fcst lower upper CI

[1,] 202.5888 -2.717626 407.8953 205.3065

[2,] 110.3385 -105.847948 326.5249 216.1864

[3,] 192.1802 -26.160397 410.5207 218.3406

[4,] 152.5464 -74.948000 380.0408 227.4944

[5,] 108.4343 -122.198058 339.0666 230.6323

[6,] 123.9001 -107.882219 355.6824 231.7823

$d.temp

fcst lower upper CI

[1,] 0.026737000 -0.1719770 0.2254510 0.1987140

[2,] -0.057081637 -0.2731569 0.1589936 0.2160753

[3,] 0.040419451 -0.1803409 0.2611798 0.2207603

[4,] 0.032591047 -0.1893108 0.2544929 0.2219019

[5,] 0.013708836 -0.2143756 0.2417933 0.2280844

[6,] -0.004319714 -0.2324070 0.2237675 0.2280873

> fcst = forecast(var3)

> plot(fcst)

So what can we conclude from this exercise? Well, let's look to the good Doctor, Hunter S. Thompson for some philosophical insight. He would likely advise us...

"res ipsa locquitur"

References:

BILANCIA, MASSIMO, and DOMENICO VITALE. "GRANGER CAUSALITY ANALYSIS OF BIVARIATE CLIMATIC TIME SERIES: A NOTE ON THE ROLE OF CO2 EMISSIONS IN GLOBAL CLIMATE WARMING."

Morice, C. P., J. J. Kennedy, N. A. Rayner, and P. D. Jones (2012), Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 dataset, J. Geophys. Res., 117, D08101, doi:10.1029/2011JD017187.

Triacca, U, Is Granger causality analysis appropriate to investigate the relationship between atmospheric concentration of carbon dioxide and global surface air temperature?, Theoretical and Applied Climatology, 81, 133-135

The concept is to find data on man-made carbon emissions and global surface temperatures. Then, using vector autoregression to identify the proper number of lags to put into a granger causality model. I will not get into any theory here, but you can see a discussion of granger causality in my very first post where I showed how to solve the age-old mystery of what comes first, the chicken or the egg (tongue firmly planted in cheek).

It is important to point out that two prior papers have shown no causal linkage between CO2 emissions and surface temperatures (Triacca, 2005 & an unpublished manuscript from Bilancia/Vitale). In essence, past observations of CO2 concentrations do not improve the statistical predictions of current surface temperatures. Be that as it may, I will attempt to duplicate such an analysis, giving any adventurous data scientist the tools and techniques to dig into this conundrum on their own.

Where can we find the data? Global CO emission estimates can be found at the Carbon Dioxide Information Analysis Center (CDIAC) at the following website - http://cdiac.ornl.gov/. You can download data of total emissions of fossil fuel combustion and cement manufacture. Surface temperature takes some detective work, but a clever soul can find it at the website of the UK Met Office Hadley Centre, part of the climate research center at the University of East Anglia, website - http://www.metoffice.gov.uk/hadobs/hadcrut4/ . An anomaly is calculated as the difference between the average annual surface temperature versus the average of the reference years, 1961 - 1990.

The data have common years from 1850 until 2010 and I downloaded and put it into a .csv for import into R. Now, it's on to the code!

> require(forecast)

> require(vars)

> var.data = read.csv(file.choose())

> head(var.data)

Year CO2 Temp

1 1850 54 -0.374

2 1851 54 -0.219

3 1852 57 -0.223

4 1853 59 -0.268

5 1854 69 -0.243

6 1855 71 -0.264

> #put data into a time series

> carbon.ts = ts(CO2, frequency=1, start=c(1850), end=c(2010))

> temp.ts = ts(Temp, frequency=1, start=c(1850), end=c(2010))

#subset the data from 1900 until 2010

> surfacetemp = window(temp.ts, start=c(1900), end=c(2010))

> co2 = window(carbon.ts, start=c(1900), end=c(2010))

> climate.ts = cbind(co2, surfacetemp)

> plot(climate.ts)

> #determine stationarity and number of lags to achieve stationarity

> ndiffs(co2, alpha = 0.05, test = c("adf"))

[1] 1

> ndiffs(surfacetemp, alpha = 0.05, test = c("adf"))

[1] 1

Using the adf test above in the ndiffs command of the forecast package, we can see that a 1st difference will allow us to achieve stationarity, which is necessary for vector autoregression and granger causality.

> #difference to achieve stationarity

> d.co2 = diff(co2)

> d.temp = diff(surfacetemp)

> #again, we need a mts class dataframe

> climate2.ts = cbind(d.co2, d.temp)

> plot(climate2.ts)

> #determine the optimal number of lags for vector autoregression

> VARselect(climate2.ts, lag.max=10) $selection

AIC(n) HQ(n) SC(n) FPE(n)

7 3 1 7

I find that the above divergence in the tests for optimal VAR modeling is quite common. Now, one can peruse the literature for what is the best statistical test to determine optimal lag length, but I like to use brute force and ignorance and try all of the above (i.e. lags 1, 3 and 7).

> #vector autoregression with lag1

> var = VAR(climate2.ts, p=1)

It is important now to test for serial autocorrelation in the model residuals and below is for the Portmanteau test (several options in the vars package are available).

> serial.test(var, lags.pt=10, type="PT.asymptotic")

Portmanteau Test (asymptotic)

data: Residuals of VAR object var

Chi-squared = 55.4989, df = 36, p-value = 0.01996

#The null hypothesis is no serial correlation, so we can reject it with extreme prejudice...on to var3

> var3 = VAR(climate2.ts, p=3)

> serial.test(var3, lags.pt=10, type="PT.asymptotic")

Portmanteau Test (asymptotic)

data: Residuals of VAR object var3

Chi-squared = 36.1256, df = 28, p-value = 0.1394

That is more like it. You can review the details of the var model, in this case temperature, if you so choose:

> summary(var3, equation="d.temp")

=========================

Endogenous variables: d.co2, d.temp

Deterministic variables: const

Sample size: 107

Log Likelihood: -548.435

Roots of the characteristic polynomial:

0.7812 0.7265 0.7265 0.6491 0.5846 0.5846

Call:

VAR(y = climate2.ts, p = 3)

Estimation results for equation d.temp:

=======================================

d.temp = d.co2.l1 + d.temp.l1 + d.co2.l2 + d.temp.l2 + d.co2.l3 + d.temp.l3 + const

Estimate Std. Error t value Pr(>|t|)

d.co2.l1 7.603e-05 1.014e-04 0.749 0.455372

d.temp.l1 -4.103e-01 9.448e-02 -4.343 3.37e-05 ***

d.co2.l2 -2.152e-05 1.115e-04 -0.193 0.847339

d.temp.l2 -3.922e-01 9.544e-02 -4.109 8.15e-05 ***

d.co2.l3 7.905e-05 1.041e-04 0.759 0.449465

d.temp.l3 -3.366e-01 9.263e-02 -3.633 0.000444 ***

const 7.539e-03 1.340e-02 0.563 0.574960

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1014 on 100 degrees of freedom

Multiple R-Squared: 0.254, Adjusted R-squared: 0.2093

F-statistic: 5.676 on 6 and 100 DF, p-value: 4.15e-05

Covariance matrix of residuals:

d.co2 d.temp

d.co2 10972.588 -1.28920

d.temp -1.289 0.01028

Correlation matrix of residuals:

d.co2 d.temp

d.co2 1.0000 -0.1214

d.temp -0.1214 1.0000

> #does co2 granger cause temperature

> grangertest(d.temp ~ d.co2, order=3)

Granger causality test

Model 1: d.temp ~ Lags(d.temp, 1:3) + Lags(d.co2, 1:3)

Model 2: d.temp ~ Lags(d.temp, 1:3)

Res.Df Df F Pr(>F)

1 100

2 103 -3 0.5064 0.6787

> #Clearly the model is not significant, so we can say that carbon emissions do not granger-cause surface temperatures.

> #does temperature granger cause co2

> grangertest(d.co2 ~ d.temp, order =3)

Granger causality test

Model 1: d.co2 ~ Lags(d.co2, 1:3) + Lags(d.temp, 1:3)

Model 2: d.co2 ~ Lags(d.co2, 1:3)

Res.Df Df F Pr(>F)

1 100

2 103 -3 0.7799 0.5079

> #try again using lag 7

> grangertest(d.temp ~ d.co2, order=7)

Granger causality test

Model 1: d.temp ~ Lags(d.temp, 1:7) + Lags(d.co2, 1:7)

Model 2: d.temp ~ Lags(d.temp, 1:7)

Res.Df Df F Pr(>F)

1 88

2 95 -7 0.5817 0.7691

Again, nothing significant using lag 7. So, using this data and the econometric techniques spelled out above, it seems there is no causal effect (statistically speaking) between fossil fuel emissions and global surface temperatures. Certainly, this is not the final word on the matter as there is much measurement error in the data that the stewards have attempted to account for.

On a side note, we can use vars for predictions and forecast for time series plots of the predicted values.

> predict(var3, n.ahead=6, ci=0.95)

$d.co2

fcst lower upper CI

[1,] 202.5888 -2.717626 407.8953 205.3065

[2,] 110.3385 -105.847948 326.5249 216.1864

[3,] 192.1802 -26.160397 410.5207 218.3406

[4,] 152.5464 -74.948000 380.0408 227.4944

[5,] 108.4343 -122.198058 339.0666 230.6323

[6,] 123.9001 -107.882219 355.6824 231.7823

$d.temp

fcst lower upper CI

[1,] 0.026737000 -0.1719770 0.2254510 0.1987140

[2,] -0.057081637 -0.2731569 0.1589936 0.2160753

[3,] 0.040419451 -0.1803409 0.2611798 0.2207603

[4,] 0.032591047 -0.1893108 0.2544929 0.2219019

[5,] 0.013708836 -0.2143756 0.2417933 0.2280844

[6,] -0.004319714 -0.2324070 0.2237675 0.2280873

> fcst = forecast(var3)

> plot(fcst)

So what can we conclude from this exercise? Well, let's look to the good Doctor, Hunter S. Thompson for some philosophical insight. He would likely advise us...

"res ipsa locquitur"

References:

BILANCIA, MASSIMO, and DOMENICO VITALE. "GRANGER CAUSALITY ANALYSIS OF BIVARIATE CLIMATIC TIME SERIES: A NOTE ON THE ROLE OF CO2 EMISSIONS IN GLOBAL CLIMATE WARMING."

Morice, C. P., J. J. Kennedy, N. A. Rayner, and P. D. Jones (2012), Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 dataset, J. Geophys. Res., 117, D08101, doi:10.1029/2011JD017187.

Triacca, U, Is Granger causality analysis appropriate to investigate the relationship between atmospheric concentration of carbon dioxide and global surface air temperature?, Theoretical and Applied Climatology, 81, 133-135

Subscribe to:
Posts (Atom)