Translate

Saturday, November 15, 2014

Causality in Time Series - A look at 2 R packages (CausalImpact and Changepoint)

My first blog post on Perceivant.com

http://perceivant.com/causality-time-series-determining-impact-marketing-interventions/


Tuesday, August 5, 2014

Results of the Readers' Survey

 First of all, let me say “Thank You” to all of the 357 people who completed the survey. I was hoping for 100, so needless to say the response blew away my expectations. This endeavor seems like a worthwhile effort to do once a year. Next year I will refine the questionnaire based on what I've learned. Perhaps this can this can evolve into more of a report on the state of the R community and less on me trying to determine what should be the focus of my blog.

The pleasant diversions of summer and the exigencies of work have precluded any effort on my part to work on the blog, but indeed there is no rest for the wicked and I feel compelled to get this posted.

As a review, the questionnaire consisted of just 13 questions:
  • Age
  • Gender
  • Education Level
  • Employment Status
  • Country
  • Years of Experience Using R
  • Ranking of Interest to See Blog Posts on Various Analysis Techniques (7-point Likert Scale)
  • Feature Selection
  • Text Mining
  • Data Visualization
  • Time Series
  • Generalized Additive Models
  • Interactive Maps
  • Favorite Topic (select from a list or write in your preference)

Keep in mind that the questions were designed to see what I should focus on in future posts, not to be an all inclusive or exhaustive list of topics/techniques.  My intent with this post is not to give a full review of the data, but to hit some highlights. 

The bottom line is that the clear winner for what readers want to see in the blog is Data Visualization. The other thing that both surprised and pleased me at the same time, is how global the blog readers are. The survey respondents were from 50 different countries! So, what I want to present here are some simple graphs and yes, visualizations, of the data I collected.

I will start out with the lattice package, move on to ggplot2, and conclude with some static maps. I'm still struggling with my preference over lattice and ggplot2. Maybe preference is not necessary and we can all learn to coexist? I'm leaning towards lattice with simple summaries and I think the code is easier overall. However, ggplot2 seems to be a little more flexible and powerful.
> library(lattice)
> attach(survey)
> table(gender) #why can't base R have better tables?
gender
Female   Male 
    28    329
> barchart(education, main="Education", xlab="Count") #barchart of education level










This basic barchart in lattice needs a couple of improvements, which is simple enough.  I want to change the     color of the bars to black and to sort the bars from high to low.  You can use the table and sort functions to         accomplish it.
> barchart(sort(table(education)), main="Education", xlab="Count", col="black")










There are a number of ways to examine a Likert  Scale.  It is quite common in market research to treat a Likert   Scale as continuous data instead of ordinal.  Another technique is to report the “top-two box”, coding the            data as a binary variable.  Below, I just look at the frequency bars of each of the questions in one plot ggplot2   and a function I copied from the excellent ebook by Winston Chang, 
Cookbook for R, http://www.cookbook-r.com/ .
You can easily build a multiplot, first creating the individual plots:
> library(ggplot2)

> p1 = ggplot(survey, aes(x=feature.select)) + geom_bar(binwidth=0.5) + ggtitle("Feature Selection")               + theme(axis.title.x = element_blank()) + ylim(0,225)
> p2 = ggplot(survey, aes(x=text.mine)) + geom_bar(binwidth=0.5) + ggtitle("Text Mining"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p3 = ggplot(survey, aes(x=data.vis)) + geom_bar(binwidth=0.5, fill="green") + ggtitle("Data Visualization)    + theme(axis.title.x = element_blank()) + ylim(0,225)
> p4 = ggplot(survey, aes(x=tseries)) + geom_bar(binwidth=0.5) + ggtitle("Time Series"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p5 = ggplot(survey, aes(x=gam)) + geom_bar(binwidth=0.5) + ggtitle("Generalized Additive Models"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p6 = ggplot(survey, aes(x=imaps)) + geom_bar(binwidth=0.5) + ggtitle("Interactive Maps"+ theme(axis.title.x = element_blank()) + ylim(0,225)
Now, here is the function to create a multiplot:
> multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
+ require(grid)
+ # Make a list from the ... arguments and plotlist
+ plots <- c(list(...), plotlist)
+ numPlots = length(plots)
+ # If layout is NULL, then use 'cols' to determine layout
+ if (is.null(layout)) {
+ # Make the panel
+ # ncol: Number of columns of plots
+ # nrow: Number of rows needed, calculated from # of cols
+ layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
+ ncol = cols, nrow = ceiling(numPlots/cols))
+ }
+ if (numPlots==1) {
+ print(plots[[1]])
+ } else {
+ # Set up the page
+ grid.newpage()
+ pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
+ # Make each plot, in the correct location
+ for (i in 1:numPlots) {
+ # Get the i,j matrix positions of the regions that contain this subplot
+ matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
+ print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
+ layout.pos.col = matchidx$col))
+ }
+ }
+ }
And now, we can compare the 6 variables on one plot with one line of code with the clear winner data                 visualization.
> multiplot(p1, p2, p3, p4, p5, p6, cols=3)





The last thing I will cover in this post is how to do maps of the data on a global scale and compare two              packages: rworldmap and googleVis.
> library(rworldmap)
> #prepare the country data 
> df = data.frame(table(country))
> #join data to map
> join = joinCountryData2Map(df, joinCode="NAME",  nameJoinColumn="country", suggestForFailedCodes=TRUE) #one country observation was missing
> par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #enables a proper plot
This code takes the log of the observations.  Without the log values a map was produced that was of no value.    It also adjusts the size of the Legend as the default seems too large in comparison with the map. 
> map = mapCountryData( join, nameColumnToPlot="Freq", catMethod="logFixedWidth", addLegend=FALSE)
> do.call( addMapLegend, c(map, legendWidth=0.5, legendMar = 2))




Now, let's try googleVis.  This will produce the plot in html and it is also interactive.  That is, you can hover      your mouse over a country and get the actual counts.
> library(googleVis)
> G1 <- gvisGeoMap(df, locationvar='country', numvar='Freq', 
options=list(dataMode="regions"))
> plot(G1)









That's it for now.  I absolutely love working with data visualization on maps and will continue to try and            incorporate them in future analysis.  I can't wait to brutally overindulge and sink my teeth into the ggmap           package.  Until next time, Mahalo.









Saturday, June 28, 2014

R Blog Readers Survey

I've almost achieved my goal of a minimum of 100 quality responses.  I will keep the survey open for a couple more days, so if you haven't responded, please feel free to have a crack at it.

Regards,

Cory

https://www.surveymonkey.com/s/7DY8DYH

Sunday, June 22, 2014

Survey - Voice of the Reader

Dear Blog Readers,

I have numerous topics I could discuss in future blog posts.  However, I would like to capture those topics that most interest you.  Therefore, I've put together a short survey (see the link below) for everyone to weigh-in on the subject.

I will keep this open for a couple of weeks and once closed, I will publish the results and make the data public.

Thank you in advance,

Cory

https://www.surveymonkey.com/s/7DY8DYH

Wednesday, May 14, 2014

The NFL Reloads: Using R to Have a Look

NFL Draft 2014

It can't be easy being a Jacksonville Jaguars fan. It seems that management has learned nothing from the Blaine Gabbert debacle. By that I mean I'm not impressed with their first round pick Blake Bortles. Bucky Brooks of NFL.com called him a “developmental prospect”. Developmental? Is that what you want from the third overall pick? I hate to say it but I have to agree with Skip Bayless that Bortles is the new Gabbert in J-Ville. We shall see soon enough if I will eat these words!
At any rate, I've downloaded draft data from the web and you can explore the draft at your own leisure. I scrapped it off of wikipedia at this link: http://en.wikipedia.org/wiki/2014_NFL_Draft
I put together some code to get you on your way to exploring

attach(nfldraft)
str(nfldraft)
## 'data.frame':    256 obs. of  8 variables:
##  $ round     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pick      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ team      : Factor w/ 32 levels "Arizona Cardinals",..: 13 29 15 4 23 2 30 8 18 11 ...
##  $ player    : Factor w/ 256 levels "Aaron Colvin ",..: 96 89 25 204 147 97 178 131 12 84 ...
##  $ position  : Factor w/ 18 levels "C","CB","DE",..: 3 11 13 18 9 11 18 2 9 17 ...
##  $ college   : Factor w/ 113 levels "Alabama","Arizona",..: 86 6 99 15 12 95 95 71 100 62 ...
##  $ conference: Factor w/ 24 levels "ACC","Big 12",..: 21 21 24 1 11 21 21 2 17 1 ...
##  $ notes     : Factor w/ 88 levels "","from Arizona[R1 - 5]",..: 1 86 1 20 1 1 1 51 9 1 ...
names(nfldraft)
## [1] "round"      "pick"       "team"       "player"     "position"  
## [6] "college"    "conference" "notes"
mytable = table(position, round)  #create a table of player position and round 
selected
mytable
##         round
## position 1 2 3 4 5 6 7
##      C   0 1 2 2 2 2 1
##      CB  5 1 2 9 4 7 5
##      DE  2 3 3 2 5 3 4
##      DT  2 3 4 3 3 1 4
##      FB  0 0 0 0 0 1 1
##      FS  0 0 0 0 0 0 2
##      G   0 1 6 1 4 2 0
##      K   0 0 0 0 0 0 2
##      LB  5 3 3 5 6 2 7
##      OLB 0 0 0 0 2 1 0
##      OT  5 4 3 1 1 3 4
##      P   0 0 0 0 0 1 0
##      QB  3 2 0 2 2 5 0
##      RB  0 3 5 5 0 4 2
##      S   4 1 2 4 3 1 0
##      SS  0 0 0 0 0 1 2
##      TE  1 3 3 0 1 0 2
##      WR  5 7 3 6 3 5 5
margin.table(mytable)  #generic table with sum...not very useful
## [1] 256
margin.table(mytable, 1)  #sum of positions selected
## position
##   C  CB  DE  DT  FB  FS   G   K  LB OLB  OT   P  QB  RB   S  SS  TE  WR 
##  10  33  22  20   2   2  14   2  31   3  21   1  14  19  15   3  10  34

margin.table(mytable, 2)  #sum of players selected by round
## round
##  1  2  3  4  5  6  7 
## 32 32 36 40 36 39 41

mosaicplot(mytable, main = "Position by Round", xlab = "Position"
ylab = "Round", cex.axis = 1)  #you can use mosaicplot to graphically 
represent the data in the table













What if you want to drill-down on a specific team or the like? Here is some simple code to examine a specific team, in this case the Colts:

colts = nfldraft[which(nfldraft$team == "Indianapolis Colts"), ]  #select the 
rows of data unique to the Colts

colts.table = table(colts$round, colts$position)

margin.table(colts.table, 2)
## 
##   C  CB  DE  DT  FB  FS   G   K  LB OLB  OT   P  QB  RB   S  SS  TE  WR 
##   0   0   1   0   0   0   0   0   1   0   2   0   0   0   0   0   0   1

barplot(colts.table)













There you have it. With some quick modifications to the above, you could produce tables examining players selected by conference, college etc.

Thursday, April 24, 2014

Analysis of Copper and Equity Prices on the Quandl.com blog

Mythbusting - Dr. Copper

“An economist is an expert who will know tomorrow why the things he predicted yesterday didn't happen today.” Laurence J. Peter (author and creator of the Peter Principle)
If you were paying attention to financial sites last month, you probably noticed a number of articles on “Dr. Copper”. Here is just a small sample of some of the headlines:
So what is all the fuss about this so-called Doctor Copper? Well, it turns out that at some point in time Copper price action was believed to be a leading indicator for the overall equities market. Thus, it was said that Copper had a PhD in Economics (one can forgive the puns linking Copper to MD).
This raises a number of questions. Was this actually the case? If so, when was it a valid indicator? When did the signal fall apart? As always, R gives us the ability to see if this mythology is confirmed, plausible or busted.
The process to explore the question will be as follows:
  • examination of Copper's price action over time
  • Econometric analysis of Copper and US equity index prices, subsetting historical periods
  • Granger Causality
Using quandl.com, we can download the time series data we need to get started.
require(Quandl)
## Loading required package: Quandl
## Warning: package 'Quandl' was built under R version 3.0.3
# download monthly copper price per metric ton and subset from 1980 forward,
copper = Quandl("WORLDBANK/WLD_COPPER", type = "ts")
plot(copper)
plot of chunk unnamed-chunk-3
cop = window(copper, start = c(1980, 1), end = c(2014, 3))
plot(cop)
plot of chunk unnamed-chunk-3
We can see that something dramatic happened around 2004/5 to drive prices significantly higher, prior to a precipitous collapse and eventual recovery. I will cut this up into a number of subsets, but for now I will play around with changepoints.
require(changepoint)
## Loading required package: changepoint
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Successfully loaded changepoint package version 1.1
## Created on 2013-06-19
##  Substantial changes to the structure of the package have occured between version 0.8 and 1.0.2.  Please see the package NEWS for details.
mean.cop = cpt.mean(cop, method = "BinSeg")
## Warning: The number of changepoints identified is Q, it is advised to
## increase Q to make sure changepoints have not been missed.
cpts(mean.cop)
## [1] 289 312 345 355 367
plot(mean.cop)
plot of chunk unnamed-chunk-4
Here are a couple of points I think are noteworthy. The first changepoint at observation 289 equates to January 2004. The collapse of price, exemplified by observation 345, is September of '08. Perhaps a reason to observe elevated price levels in copper that last decade or so, is the fact of China's economic growth and that country's use of copper as collateral. Take this quote for example, “A Reuters report this week that Chinese companies have used something between 60 and 80 percent of the country's copper imports as collateral to finance projects has unsettled markets, triggered a fresh wave of selling.” “This is an amazing estimate; it would change the perception of 'Dr. Copper' as a gauge of the Chinese economy, as it's not being used for industrial production, but rather as a financing tool for whatever reason,” said Chris Weston, chief market strategist at IG.“ (http://www.cnbc.com/id/101486303)
Financing tool!? If that is the case then Dr. Copper is currently no better than a strip mall psychic at market prognostication. But, what about the past? In a prior financial epoch was the 'Red Metal' the 'Jimmy the Greek' of US equitites? A side-by-side comparison with US stocks is in order. Again, we can quickly get our data from quandl.com…
sandp500 = Quandl("YAHOO/INDEX_SPY", type = "ts", collapse = "monthly", column = 6)
SandP = log(window(sandp500, start = c(1995, 1), end = c(2014, 3)))
plot(SandP)
plot of chunk unnamed-chunk-5
# match copper time series to S&P 500
cop2 = log(window(cop, start = c(1995, 1), end = c(2014, 3)))
Now, let's compare the charts together.
compare.ts = cbind(SandP, cop2)
plot(compare.ts)
plot of chunk unnamed-chunk-6
Taking a look at this from '95 to '05, can anyone see anything here that would lead us to conclude that copper is a leading indicator? Maybe from 2003 until 2011 there is something. So, let's cut this down into that subset.
sp.sub = window(SandP, start = c(2003, 1), end = c(2011, 12))
cop.sub = window(cop2, start = c(2003, 1), end = c(2011, 12))
compare2.ts = cbind(sp.sub, cop.sub)
plot(compare2.ts)
plot of chunk unnamed-chunk-7
Perhaps we are on to something here, a period of time where copper truly was a leading indicator.
In previous posts, I looked at vector autoregression and granger causality with differenced data to achieve stationarity. However, here I will apply VAR using "levels”, with the techniques spelled out in these two blog posts:
We first determine the maximum amount of integration.
require(forecast)
## Loading required package: forecast
## This is forecast 5.0
ndiffs(sp.sub, alpha = 0.05, test = c("kpss"))
## [1] 1
ndiffs(cop.sub, alpha = 0.05, test = c("kpss"))
## [1] 1
Second, run a VAR model selection to determine the correct number of lags. This is done on levels not the difference.
require(vars)
## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
VARselect(compare2.ts, lag = 24, type = "both")
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      2      2      2 
## 
## $criteria
##                 1          2          3          4          5          6
## AIC(n) -1.096e+01 -1.126e+01 -1.122e+01 -1.123e+01 -1.119e+01 -1.111e+01
## HQ(n)  -1.086e+01 -1.112e+01 -1.104e+01 -1.100e+01 -1.091e+01 -1.079e+01
## SC(n)  -1.073e+01 -1.092e+01 -1.076e+01 -1.066e+01 -1.049e+01 -1.030e+01
## FPE(n)  1.743e-05  1.285e-05  1.338e-05  1.324e-05  1.390e-05  1.502e-05
##                 7          8          9         10         11         12
## AIC(n) -1.106e+01 -1.101e+01 -1.100e+01 -1.096e+01 -1.089e+01 -1.087e+01
## HQ(n)  -1.069e+01 -1.060e+01 -1.053e+01 -1.045e+01 -1.033e+01 -1.027e+01
## SC(n)  -1.014e+01 -9.972e+00 -9.842e+00 -9.692e+00 -9.500e+00 -9.369e+00
## FPE(n)  1.582e-05  1.669e-05  1.703e-05  1.774e-05  1.928e-05  1.977e-05
##                13         14         15         16         17         18
## AIC(n) -1.084e+01 -1.080e+01 -1.071e+01 -1.075e+01 -1.071e+01 -1.071e+01
## HQ(n)  -1.019e+01 -1.010e+01 -9.968e+00 -9.956e+00 -9.871e+00 -9.821e+00
## SC(n)  -9.218e+00 -9.061e+00 -8.861e+00 -8.779e+00 -8.625e+00 -8.506e+00
## FPE(n)  2.070e-05  2.184e-05  2.414e-05  2.373e-05  2.516e-05  2.582e-05
##                19         20         21         22         23         24
## AIC(n) -1.068e+01 -1.073e+01 -1.071e+01 -1.070e+01 -1.063e+01 -1.058e+01
## HQ(n)  -9.746e+00 -9.756e+00 -9.685e+00 -9.631e+00 -9.516e+00 -9.418e+00
## SC(n)  -8.362e+00 -8.303e+00 -8.162e+00 -8.039e+00 -7.854e+00 -7.687e+00
## FPE(n)  2.728e-05  2.656e-05  2.817e-05  2.947e-05  3.298e-05  3.647e-05
# lag2 is optimal for VAR
This is where it gets tricky. To get the linear models I will need to create the lagged variables. The max number of lags is equal to the VAR selection plus the maximum differences, so the magic number here is 3.
sp = zoo(sp.sub)
lag.sp = lag(sp, -(0:3), na.pad = T)
copper = zoo(cop.sub)
lag.copper = lag(copper, -(0:3), na.pad = T)
Here, the two linear models are built. Note that the index is used to account for the trend. You also need to sequence the variables the same in each lm, so that the Wald test makes sense.
lm1 = lm(sp ~ lag.sp[, 2:4] + lag.copper[, 2:4] + index(sp))
lm2 = lm(copper ~ lag.sp[, 2:4] + lag.copper[, 2:4] + index(sp))
With the linear models built it is time to concoct the Wald Tests and examine the chi-squared p-values. Again, this is tricky, but follow along… We are testing the significance of the lagged coefficients of copper on S&P and vice versa. However, this method does not include the one lag greater than the optimal VAR lag (in this case 2). This model was built with 3 lags!
require(aod)
## Loading required package: aod
## Warning: package 'aod' was built under R version 3.0.3
wald.test(b = coef(lm1), Sigma = vcov(lm1), Terms = c(5:6), df = 2)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 9.7, df = 2, P(> X2) = 0.0079
## 
## F test:
## W = 4.8, df1 = 2, df2 = 2, P(> W) = 0.17
wald.test(b = coef(lm2), Sigma = vcov(lm2), Terms = c(2:3), df = 2)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 13.2, df = 2, P(> X2) = 0.0014
## 
## F test:
## W = 6.6, df1 = 2, df2 = 2, P(> W) = 0.13
Given the significant p-values for both tests, what can we conclude? For Granger-Causality when you have significant tests 'both ways', it is believed you are missing some important variable and casuality is rejected. If we accept that as the case here, then the prognostic value of copper even during this limited time frame is 'Busted'. However, before we throw the proverbial baby out with the bath water it is important to discuss an important limitation of this work. This analysis was done with monthly data and as a result may not be sensitive enough to capture a signal of causality. All things considered, I have to say that the Dr. Copper Myth is not totally Busted for the given timeframe we examined. It is certainly not Confirmed, so at best it is Plausible.
CL