Translate

Monday, January 12, 2015

NFL Cluster Analysis...Part 2

Another post on Cluster Analysis and the NFL, this time on the Perceivant.com blog.

http://perceivant.com/nfl-ranking-cluster-analysis/


Monday, December 29, 2014

Cluster Analysis of the NFL’s Top Wide Receivers

“The time has come to get deeply into football. It is the only thing we have left that ain't fixed.”
Hunter S. Thompson, Hey Rube Column, November 9, 2004

I have to confess that I haven’t been following the NFL this year as much as planned or hoped.  On only 3 or 4 occasions this year have I been able to achieve a catatonic state while watching NFL RedZone.  Nonetheless, it is easy to envision how it is all going to end.  Manning will throw four picks on a cold snowy day in Foxboro in the AFC Championship game and the Seahawk defense will curb-stomp Aaron Rodgers and capture a consecutive NFC crown.  As for the Super Bowl, well who cares other than the fact that we must cheer against the evil Patriot empire, rooting for their humiliating demise.  One can simultaneously hate and admire that team.  I prefer to do the former publicly and the latter in private.

We all seem to have a handle on the good and bad quarterbacks out there, but what about their wide receivers?  With the playoffs at the doorstep and ignorance of the situation, I had to bring myself up to speed.  This is a great opportunity to do a cluster analysis of the top wide receivers and see who is worth keeping an eye on in the upcoming spectacle.

A good source for interesting statistics and articles on NFL players and teams is http://www.advancedfootballanalytics.com/ .  Here we can download the rankings for the top 40 wide receivers based on Win Probability Added or “WPA”.  To understand the calculation you should head over to the site and read the details.  The site provides a rationale on WPA by saying “WPA has a number of applications. For starters, we can tell which plays were truly critical in each game. From a fan’s perspective, we can call a play the ‘play of the week’ or the ‘play of the year.’ And although we still can't separate an individual player's performance from that of his teammates', we add up the total WPA for plays in which individual players took part. This can help us see who really made the difference when it matters most. It can help tell us who is, or at least appears to be, “clutch.” It can also help inform us who really deserves the player of the week award, the selection to the Pro Bowl, or even induction into the Hall of Fame.”  

I put the website’s wide receiver data table in a .csv and we can start the analysis, reading in the file and examining its structure.

> receivers <- read.csv(file.choose())
> str(receivers)
'data.frame': 40 obs. of  19 variables:
 $ Rank     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Player   : Factor w/ 40 levels "10-E.Sanders",..: 16 33 1 23 37 24 36 2 4 13 ...
 $ Team     : Factor w/ 28 levels "ARZ","ATL","BLT",..: 11 23 10 22 9 12 12 28 17 19 ...
 $ G        : int  16 16 16 16 16 16 16 14 14 12 ...
 $ WPA      : num  2.43 2.4 2.33 2.33 2.3 2.27 2.19 1.91 1.89 1.76 ...
 $ EPA      : num  59 95.7 81.3 56.8 78.8 86.2 97.3 54.3 63.6 64.6 ...
 $ WPA_G    : num  0.15 0.15 0.15 0.15 0.14 0.14 0.14 0.14 0.14 0.15 ...
 $ EPA_P    : num  0.38 0.48 0.5 0.38 0.54 0.6 0.58 0.54 0.41 0.43 ...
 $ SR_PERC  : num  55.4 62.8 61.3 52 58.5 59.4 60.5 54.5 58.1 55 ...
 $ YPR      : num  13.4 13.2 13.9 15.5 15 14.1 15.5 20.2 10.6 14.3 ...
 $ Rec      : int  99 129 101 86 88 91 98 52 92 91 ...
 $ Yds      : int  1331 1698 1404 1329 1320 1287 1519 1049 972 1305 ...
 $ RecTD    : int  4 13 9 10 16 12 13 5 4 12 ...
 $ Tgts     : int  143 181 141 144 136 127 151 88 134 130 ...
 $ PER_Tgt  : num  24.2 30.2 23.4 23.5 28.9 23.9 28.4 16.2 22.3 21.7 ...
 $ YPT      : num  9.3 9.4 10 9.2 9.7 10.1 10.1 11.9 7.3 10 ...
 $ C_PERC   : num  69.2 71.3 71.6 59.7 64.7 71.7 64.9 59.1 68.7 70 ...
 $ PERC_DEEP: num  19.6 26.5 36.2 30.6 30.9 23.6 31.8 33 16.4 28.5 ...
 $ playoffs : Factor w/ 2 levels "n","y": 2 2 2 1 2 2 2 1 2 1 ...

> head(receivers$Player)
[1] 15-G.Tate    84-A.Brown   10-E.Sanders 18-J.Maclin  88-D.Bryant
[6] 18-R.Cobb

Based on WPA, Golden Tate of Detroit is the highest ranked wide receiver; in contrast his highly-regarded teammate is ranked 13th.

We talked about WPA so here is a quick synopsis on the other variables; again, please go to the website for detailed explanations:

  • EPA - Expected Points Added
  • WPA_G - WPA per game
  • EPA_P - Expected Points Added per Play
  • SR_PERC - Success Rate of plays the receiver was involved that are considered successful
  • YPR - Yards Per Reception
  • Rec - Total Receptions
  • Yds - Total Reception Yards
  • RecTD - Receiving Touchdowns
  • Tgts - The times a receiver was targeted in the passing game
  • PER_Tgts - Percentage of time a team’s passes were targeted to the receiver
  • YPT - Yards per times targeted by a pass
  • C_PERC - Completion percentage
  • PERC_DEEP - Percent of passes targeted deep
  • playoffs - A factor I coded on whether the receiver’s team is in the playoffs or not

To do hierarchical clustering with this data, we can use the hclust() function available in base R.  In preparation for that, we should scale the data and we must create a distance matrix.

> r.df <- receivers[,c(4:18)]
> rownames(r.df) <- receivers[,2]
> scaled <- scale(r.df)
> d <- dist(scaled)

With the data prepared, produce the cluster object and plot it.

> hc <- hclust(d, method="ward.D")
> plot(hc, hang=-1, xlab="", sub="")


This is the standard dendrogram produced with hclust.  We now need to select the proper number of clusters and produce a dendrogram that is easier to examine.  For this, I found some interesting code to adapt on Gaston Sanchez’s blog: http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html .  Since I am leaning towards 5 clusters, let’s first create a vector of colors.  (Note: you can find/search for color codes on colorhexa.com)

> labelColors = c("#FF0000",  "#800080","#0000ff", "#ff8c00","#013220")

Then use the cutree() function to specify 5 clusters

> clusMember = cutree(hc, 5)

Now, we create a function (courtesy of Gaston) to apply colors to the clusters in the dendrogram.

> colLab <- function(n) {
+   if (is.leaf(n)) {
+     a <- attributes(n)
+     labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]
+     attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
+   }
+   n
+ }

Finally, we turn “hc” into a dendrogram object and plot the new results.

> hcd <- as.dendrogram(hc)
> clusDendro = dendrapply(hcd, colLab)
> plot(clusDendro, main = "NFL Receiver Clusters", type = "triangle")



That is much better.  For more in-depth analysis you can put the clusters back into the original dataframe.

> receivers$cluster <- as.factor(cutree(hc, 5))

It is now rather interesting to plot the variables by cluster to examine the differences.  In the interest of time and space, I present just a boxplot of WPA by cluster.

> boxplot(WPA~cluster, data=receivers, main="Receiver Rank by Cluster")



Before moving on, I present this simple table of the clusters of the receivers by playoff qualification.  Interesting to note that cluster 1 with the high WPA, also has 10 of 13 receivers in the playoffs.  One of the things that would be worth a look I think is to adjust wide receiver WPA by some weight based on their QB quality.  Note that Randall Cobb and Jordy Nelson of the Packers have high WPA, ranked 6 and 7 respectively, but have the privilege of having Rodgers as QB.  Remember, in the quote above WPA does not have the ability to separate an individual’s success from a teammate’s success.  This raises some interesting questions for me that require further inquiry.

> table(receivers$cluster, receivers$playoff)
 
        n     y
  1    3   10
  2    4     2
  3    7     4
  4    4     0
  5    3     3

In closing the final blog of the year, I must make some predictions for the College Football playoffs.  I hate to say it, but I think Alabama will roll over Ohio State.  In the Rose Bowl, FSU comes from behind to win...again!  I’d really like to see the Ducks win it all, but I just don’t see their defense being of the quality to stop Winston when it counts, which will be in the fourth quarter.  ‘Bama has that defense, well, the defensive line and backers anyway.  Therefore, I have to give the Crimson Tide the nod in the championship.  The news is not all bad.  Nebraska finally let go of Bo Pelini.  I was ecstatic about his hire, but the paucity of top-notch recruits finally manifested itself with perpetual high-level mediocrity.  His best years were with Callaghan’s recruits, Ndamukong Suh among many others.  They should have hired Paul Johnson from Georgia Tech, at least it would have been fun and somewhat nostalgic to watch Husker football again.

Mahalo,

Cory

Saturday, November 15, 2014

Causality in Time Series - A look at 2 R packages (CausalImpact and Changepoint)

My first blog post on Perceivant.com

http://perceivant.com/causality-time-series-determining-impact-marketing-interventions/


Tuesday, August 5, 2014

Results of the Readers' Survey

 First of all, let me say “Thank You” to all of the 357 people who completed the survey. I was hoping for 100, so needless to say the response blew away my expectations. This endeavor seems like a worthwhile effort to do once a year. Next year I will refine the questionnaire based on what I've learned. Perhaps this can this can evolve into more of a report on the state of the R community and less on me trying to determine what should be the focus of my blog.

The pleasant diversions of summer and the exigencies of work have precluded any effort on my part to work on the blog, but indeed there is no rest for the wicked and I feel compelled to get this posted.

As a review, the questionnaire consisted of just 13 questions:
  • Age
  • Gender
  • Education Level
  • Employment Status
  • Country
  • Years of Experience Using R
  • Ranking of Interest to See Blog Posts on Various Analysis Techniques (7-point Likert Scale)
  • Feature Selection
  • Text Mining
  • Data Visualization
  • Time Series
  • Generalized Additive Models
  • Interactive Maps
  • Favorite Topic (select from a list or write in your preference)

Keep in mind that the questions were designed to see what I should focus on in future posts, not to be an all inclusive or exhaustive list of topics/techniques.  My intent with this post is not to give a full review of the data, but to hit some highlights. 

The bottom line is that the clear winner for what readers want to see in the blog is Data Visualization. The other thing that both surprised and pleased me at the same time, is how global the blog readers are. The survey respondents were from 50 different countries! So, what I want to present here are some simple graphs and yes, visualizations, of the data I collected.

I will start out with the lattice package, move on to ggplot2, and conclude with some static maps. I'm still struggling with my preference over lattice and ggplot2. Maybe preference is not necessary and we can all learn to coexist? I'm leaning towards lattice with simple summaries and I think the code is easier overall. However, ggplot2 seems to be a little more flexible and powerful.
> library(lattice)
> attach(survey)
> table(gender) #why can't base R have better tables?
gender
Female   Male 
    28    329
> barchart(education, main="Education", xlab="Count") #barchart of education level










This basic barchart in lattice needs a couple of improvements, which is simple enough.  I want to change the     color of the bars to black and to sort the bars from high to low.  You can use the table and sort functions to         accomplish it.
> barchart(sort(table(education)), main="Education", xlab="Count", col="black")










There are a number of ways to examine a Likert  Scale.  It is quite common in market research to treat a Likert   Scale as continuous data instead of ordinal.  Another technique is to report the “top-two box”, coding the            data as a binary variable.  Below, I just look at the frequency bars of each of the questions in one plot ggplot2   and a function I copied from the excellent ebook by Winston Chang, 
Cookbook for R, http://www.cookbook-r.com/ .
You can easily build a multiplot, first creating the individual plots:
> library(ggplot2)

> p1 = ggplot(survey, aes(x=feature.select)) + geom_bar(binwidth=0.5) + ggtitle("Feature Selection")               + theme(axis.title.x = element_blank()) + ylim(0,225)
> p2 = ggplot(survey, aes(x=text.mine)) + geom_bar(binwidth=0.5) + ggtitle("Text Mining"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p3 = ggplot(survey, aes(x=data.vis)) + geom_bar(binwidth=0.5, fill="green") + ggtitle("Data Visualization)    + theme(axis.title.x = element_blank()) + ylim(0,225)
> p4 = ggplot(survey, aes(x=tseries)) + geom_bar(binwidth=0.5) + ggtitle("Time Series"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p5 = ggplot(survey, aes(x=gam)) + geom_bar(binwidth=0.5) + ggtitle("Generalized Additive Models"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p6 = ggplot(survey, aes(x=imaps)) + geom_bar(binwidth=0.5) + ggtitle("Interactive Maps"+ theme(axis.title.x = element_blank()) + ylim(0,225)
Now, here is the function to create a multiplot:
> multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
+ require(grid)
+ # Make a list from the ... arguments and plotlist
+ plots <- c(list(...), plotlist)
+ numPlots = length(plots)
+ # If layout is NULL, then use 'cols' to determine layout
+ if (is.null(layout)) {
+ # Make the panel
+ # ncol: Number of columns of plots
+ # nrow: Number of rows needed, calculated from # of cols
+ layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
+ ncol = cols, nrow = ceiling(numPlots/cols))
+ }
+ if (numPlots==1) {
+ print(plots[[1]])
+ } else {
+ # Set up the page
+ grid.newpage()
+ pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
+ # Make each plot, in the correct location
+ for (i in 1:numPlots) {
+ # Get the i,j matrix positions of the regions that contain this subplot
+ matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
+ print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
+ layout.pos.col = matchidx$col))
+ }
+ }
+ }
And now, we can compare the 6 variables on one plot with one line of code with the clear winner data                 visualization.
> multiplot(p1, p2, p3, p4, p5, p6, cols=3)





The last thing I will cover in this post is how to do maps of the data on a global scale and compare two              packages: rworldmap and googleVis.
> library(rworldmap)
> #prepare the country data 
> df = data.frame(table(country))
> #join data to map
> join = joinCountryData2Map(df, joinCode="NAME",  nameJoinColumn="country", suggestForFailedCodes=TRUE) #one country observation was missing
> par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #enables a proper plot
This code takes the log of the observations.  Without the log values a map was produced that was of no value.    It also adjusts the size of the Legend as the default seems too large in comparison with the map. 
> map = mapCountryData( join, nameColumnToPlot="Freq", catMethod="logFixedWidth", addLegend=FALSE)
> do.call( addMapLegend, c(map, legendWidth=0.5, legendMar = 2))




Now, let's try googleVis.  This will produce the plot in html and it is also interactive.  That is, you can hover      your mouse over a country and get the actual counts.
> library(googleVis)
> G1 <- gvisGeoMap(df, locationvar='country', numvar='Freq', 
options=list(dataMode="regions"))
> plot(G1)









That's it for now.  I absolutely love working with data visualization on maps and will continue to try and            incorporate them in future analysis.  I can't wait to brutally overindulge and sink my teeth into the ggmap           package.  Until next time, Mahalo.









Saturday, June 28, 2014

R Blog Readers Survey

I've almost achieved my goal of a minimum of 100 quality responses.  I will keep the survey open for a couple more days, so if you haven't responded, please feel free to have a crack at it.

Regards,

Cory

https://www.surveymonkey.com/s/7DY8DYH

Sunday, June 22, 2014

Survey - Voice of the Reader

Dear Blog Readers,

I have numerous topics I could discuss in future blog posts.  However, I would like to capture those topics that most interest you.  Therefore, I've put together a short survey (see the link below) for everyone to weigh-in on the subject.

I will keep this open for a couple of weeks and once closed, I will publish the results and make the data public.

Thank you in advance,

Cory

https://www.surveymonkey.com/s/7DY8DYH

Wednesday, May 14, 2014

The NFL Reloads: Using R to Have a Look

NFL Draft 2014

It can't be easy being a Jacksonville Jaguars fan. It seems that management has learned nothing from the Blaine Gabbert debacle. By that I mean I'm not impressed with their first round pick Blake Bortles. Bucky Brooks of NFL.com called him a “developmental prospect”. Developmental? Is that what you want from the third overall pick? I hate to say it but I have to agree with Skip Bayless that Bortles is the new Gabbert in J-Ville. We shall see soon enough if I will eat these words!
At any rate, I've downloaded draft data from the web and you can explore the draft at your own leisure. I scrapped it off of wikipedia at this link: http://en.wikipedia.org/wiki/2014_NFL_Draft
I put together some code to get you on your way to exploring

attach(nfldraft)
str(nfldraft)
## 'data.frame':    256 obs. of  8 variables:
##  $ round     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pick      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ team      : Factor w/ 32 levels "Arizona Cardinals",..: 13 29 15 4 23 2 30 8 18 11 ...
##  $ player    : Factor w/ 256 levels "Aaron Colvin ",..: 96 89 25 204 147 97 178 131 12 84 ...
##  $ position  : Factor w/ 18 levels "C","CB","DE",..: 3 11 13 18 9 11 18 2 9 17 ...
##  $ college   : Factor w/ 113 levels "Alabama","Arizona",..: 86 6 99 15 12 95 95 71 100 62 ...
##  $ conference: Factor w/ 24 levels "ACC","Big 12",..: 21 21 24 1 11 21 21 2 17 1 ...
##  $ notes     : Factor w/ 88 levels "","from Arizona[R1 - 5]",..: 1 86 1 20 1 1 1 51 9 1 ...
names(nfldraft)
## [1] "round"      "pick"       "team"       "player"     "position"  
## [6] "college"    "conference" "notes"
mytable = table(position, round)  #create a table of player position and round 
selected
mytable
##         round
## position 1 2 3 4 5 6 7
##      C   0 1 2 2 2 2 1
##      CB  5 1 2 9 4 7 5
##      DE  2 3 3 2 5 3 4
##      DT  2 3 4 3 3 1 4
##      FB  0 0 0 0 0 1 1
##      FS  0 0 0 0 0 0 2
##      G   0 1 6 1 4 2 0
##      K   0 0 0 0 0 0 2
##      LB  5 3 3 5 6 2 7
##      OLB 0 0 0 0 2 1 0
##      OT  5 4 3 1 1 3 4
##      P   0 0 0 0 0 1 0
##      QB  3 2 0 2 2 5 0
##      RB  0 3 5 5 0 4 2
##      S   4 1 2 4 3 1 0
##      SS  0 0 0 0 0 1 2
##      TE  1 3 3 0 1 0 2
##      WR  5 7 3 6 3 5 5
margin.table(mytable)  #generic table with sum...not very useful
## [1] 256
margin.table(mytable, 1)  #sum of positions selected
## position
##   C  CB  DE  DT  FB  FS   G   K  LB OLB  OT   P  QB  RB   S  SS  TE  WR 
##  10  33  22  20   2   2  14   2  31   3  21   1  14  19  15   3  10  34

margin.table(mytable, 2)  #sum of players selected by round
## round
##  1  2  3  4  5  6  7 
## 32 32 36 40 36 39 41

mosaicplot(mytable, main = "Position by Round", xlab = "Position"
ylab = "Round", cex.axis = 1)  #you can use mosaicplot to graphically 
represent the data in the table













What if you want to drill-down on a specific team or the like? Here is some simple code to examine a specific team, in this case the Colts:

colts = nfldraft[which(nfldraft$team == "Indianapolis Colts"), ]  #select the 
rows of data unique to the Colts

colts.table = table(colts$round, colts$position)

margin.table(colts.table, 2)
## 
##   C  CB  DE  DT  FB  FS   G   K  LB OLB  OT   P  QB  RB   S  SS  TE  WR 
##   0   0   1   0   0   0   0   0   1   0   2   0   0   0   0   0   0   1

barplot(colts.table)













There you have it. With some quick modifications to the above, you could produce tables examining players selected by conference, college etc.