Saturday, November 7, 2015

Plotting Russian AiRstRikes in SyRia

"Who do we think will rise if Assad falls?"
"Do we have a “government in a box” that we think we can fly to Damascus and put into power if the Syrian army collapses, the regime falls and ISIS approaches the capital?"
"Have we forgotten the lesson of “Animal Farm”? When the animals revolt and take over the farm, the pigs wind up in charge." 
Patrick J. Buchanan

In my new book, "Mastering Machine Learning with R", I wanted to include geo-spatial mapping in the chapter on cluster analysis.  I actually completed the entire chapter doing a cluster analysis on the Iraq Wikileaks data, plotting the clusters on a map and building a story around developing an intelligence estimate for the Al-Doura Oil Refinery, which I visited on many occasions during my 2009 "sabbatical".  However, the publisher convinced me that the material was too sensitive for such a book and I totally re-wrote the analysis with a different data set.  I may or may not publish it on this blog at some point, but I want to continue to explore building maps in R.  As luck would have it, I stumbled into a data set showing the locations of Russian airstrikes in Syria at the following site:

The data includes the latitude and longitude of the strikes along with other background information. The what, how and why the data was collected is available here:

In short, the site tried to independently verify locations, targets etc., plus includes what they claim are the reported versus actual strike locations.  When I pulled the data there were 60 strikes analyzed by the site.  They were unable to determine the locations of 11 of the strikes, so we have 49 data points.

I built the data in excel and put in a .csv, which I've already loaded.  Here is the structure of the data.

> str(airstrikes)
'data.frame':  120 obs. of  4 variables:
 $ Airstrikes   : chr  "Strike 1" "Strike 10" "Strike 11" "Strike 12" ...
 $ Lat          : chr  "35.687782" "35.725846" "35.734952" "35.719518" ...
 $ Long         : chr  "36.786667" "36.260419" "36.073837" "36.072385" ...
 $ real_reported: chr  "real" "real" "real" "real" ...

> head(airstrikes)
  Airstrikes       Lat      Long real_reported
1   Strike 1 35.687782 36.786667          real
2  Strike 10 35.725846 36.260419          real
3  Strike 11 35.734952 36.073837          real
4  Strike 12 35.719518 36.072385          real
5  Strike 13 35.309074 36.620506          real
6  Strike 14 35.817206 36.124503          real

> tail(airstrikes)
    Airstrikes       Lat      Long real_reported
115  Strike 59 35.644864 36.338568      reported
116   Strike 6 35.740134 36.247029      reported
117  Strike 60  36.09346 37.085198      reported
118   Strike 7 35.702113 36.563525      reported
119   Strike 8 35.822472 36.018779      reported
120   Strike 9 35.725846 36.260419      reported

Since lat and long are character, I need to change them to numeric and also keep a subset of data of the actual/real strike locations.

> airstrikes$Lat = as.numeric(airstrikes$Lat)
Warning message:
NAs introduced by coercion

> airstrikes$Long = as.numeric(airstrikes$Long)
Warning message:
NAs introduced by coercion

> real=subset(airstrikes, airstrikes$real_reported=="real")

I will be using ggmap for this effort and pull in google maps for plotting.

> library(ggmap)
Loading required package: ggplot2
Google Maps API Terms of Service:
Please cite ggmap if you use it: see citation('ggmap') for details.

> citation('ggmap')

To cite ggmap in publications, please use:

  D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The
  R Journal, 5(1), 144-161. URL

The first map will be an overall view of the country with the map type as "terrain".  Note that "satellite", "hybrid" and "roadmap" are also available.

> map1 = ggmap(

   get_googlemap(center="Syria", zoom=7, maptype="terrain"))

With the map created as object "map1", I plot the locations using "geom_point()".

> map1 + geom_point(
   data = real, aes (x = Long, y = Lat), pch = 19,  size = 6, col="red3")

With the exception of what looks like one strike near Ar Raqqah, we can see they are concentrated between Aleppo and Homs with some close to the Turkish border.  Let's have a closer look at that region.

> map2 = ggmap(
   get_googlemap(center="Ehsim, Syria", zoom=9, maptype="terrain"))

> map2 + geom_point(data = real, aes (x = Long, y = Lat),
  pch = 18,  size = 9, col="red2")

East of Ghamam is a large concentration, so let's zoom in on that area and add the strike number as labels.

> map3 = ggmap(
   get_googlemap(center="Dorien, Syria",zoom=13, maptype="hybrid"))

> map3 + geom_point(
   data = real, aes (x = Long, y = Lat),pch = 18, size = 9, col="red3") +
   geom_text(data=real,aes(x=Long, y=Lat, label=Airstrikes),
   size = 5, vjust = 0, hjust = -0.25, color="white")

The last thing I want to do is focus in on the site for Strike 28.  To do this we will require the lat and long, which we can find with the which() function.

> which(real$Airstrikes =="Strike 28")
[1] 21

> real[21,]
   Airstrikes      Lat     Long real_reported

21  Strike 28 35.68449 36.11946          real

It is now just a simple matter of using those coordinates for calling up the google map.

> map4 = ggmap(
   get_googlemap(center=c(lon=36.11946,lat=35.68449), zoom=17, maptype="satellite"))

> map4 + geom_point(
   data = real, aes (x = Long, y = Lat), pch = 22, size = 12, col="red3")
   + geom_text(data=real,aes(x=Long, y=Lat, label=Airstrikes),
   size = 9, vjust = 0, hjust = -0.25, color="white")

From the looks of it, this seems to be an isolated location, so it was probably some sort of base or logistics center.  If you're interested, the Russian Ministry of Defense posts videos of these strikes and you can see this one on YouTube.

OK, so that is a quick tutorial on using ggmap, a very powerful package.  We've just scratched the surface of what it can do.  I will continue to monitor the site for additional data.  Perhaps publish a Shiny app if the data is large and "rich" enough.



Friday, October 30, 2015

Machine Learning with R

My effort on machine learning with R is now available.

The book is available on and  A preview is also available on at this link:

Monday, January 12, 2015

NFL Cluster Analysis...Part 2

Another post on Cluster Analysis and the NFL, this time on the blog.

Monday, December 29, 2014

Cluster Analysis of the NFL’s Top Wide Receivers

“The time has come to get deeply into football. It is the only thing we have left that ain't fixed.”
Hunter S. Thompson, Hey Rube Column, November 9, 2004

I have to confess that I haven’t been following the NFL this year as much as planned or hoped.  On only 3 or 4 occasions this year have I been able to achieve a catatonic state while watching NFL RedZone.  Nonetheless, it is easy to envision how it is all going to end.  Manning will throw four picks on a cold snowy day in Foxboro in the AFC Championship game and the Seahawk defense will curb-stomp Aaron Rodgers and capture a consecutive NFC crown.  As for the Super Bowl, well who cares other than the fact that we must cheer against the evil Patriot empire, rooting for their humiliating demise.  One can simultaneously hate and admire that team.  I prefer to do the former publicly and the latter in private.

We all seem to have a handle on the good and bad quarterbacks out there, but what about their wide receivers?  With the playoffs at the doorstep and ignorance of the situation, I had to bring myself up to speed.  This is a great opportunity to do a cluster analysis of the top wide receivers and see who is worth keeping an eye on in the upcoming spectacle.

A good source for interesting statistics and articles on NFL players and teams is .  Here we can download the rankings for the top 40 wide receivers based on Win Probability Added or “WPA”.  To understand the calculation you should head over to the site and read the details.  The site provides a rationale on WPA by saying “WPA has a number of applications. For starters, we can tell which plays were truly critical in each game. From a fan’s perspective, we can call a play the ‘play of the week’ or the ‘play of the year.’ And although we still can't separate an individual player's performance from that of his teammates', we add up the total WPA for plays in which individual players took part. This can help us see who really made the difference when it matters most. It can help tell us who is, or at least appears to be, “clutch.” It can also help inform us who really deserves the player of the week award, the selection to the Pro Bowl, or even induction into the Hall of Fame.”  

I put the website’s wide receiver data table in a .csv and we can start the analysis, reading in the file and examining its structure.

> receivers <- read.csv(file.choose())
> str(receivers)
'data.frame': 40 obs. of  19 variables:
 $ Rank     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Player   : Factor w/ 40 levels "10-E.Sanders",..: 16 33 1 23 37 24 36 2 4 13 ...
 $ Team     : Factor w/ 28 levels "ARZ","ATL","BLT",..: 11 23 10 22 9 12 12 28 17 19 ...
 $ G        : int  16 16 16 16 16 16 16 14 14 12 ...
 $ WPA      : num  2.43 2.4 2.33 2.33 2.3 2.27 2.19 1.91 1.89 1.76 ...
 $ EPA      : num  59 95.7 81.3 56.8 78.8 86.2 97.3 54.3 63.6 64.6 ...
 $ WPA_G    : num  0.15 0.15 0.15 0.15 0.14 0.14 0.14 0.14 0.14 0.15 ...
 $ EPA_P    : num  0.38 0.48 0.5 0.38 0.54 0.6 0.58 0.54 0.41 0.43 ...
 $ SR_PERC  : num  55.4 62.8 61.3 52 58.5 59.4 60.5 54.5 58.1 55 ...
 $ YPR      : num  13.4 13.2 13.9 15.5 15 14.1 15.5 20.2 10.6 14.3 ...
 $ Rec      : int  99 129 101 86 88 91 98 52 92 91 ...
 $ Yds      : int  1331 1698 1404 1329 1320 1287 1519 1049 972 1305 ...
 $ RecTD    : int  4 13 9 10 16 12 13 5 4 12 ...
 $ Tgts     : int  143 181 141 144 136 127 151 88 134 130 ...
 $ PER_Tgt  : num  24.2 30.2 23.4 23.5 28.9 23.9 28.4 16.2 22.3 21.7 ...
 $ YPT      : num  9.3 9.4 10 9.2 9.7 10.1 10.1 11.9 7.3 10 ...
 $ C_PERC   : num  69.2 71.3 71.6 59.7 64.7 71.7 64.9 59.1 68.7 70 ...
 $ PERC_DEEP: num  19.6 26.5 36.2 30.6 30.9 23.6 31.8 33 16.4 28.5 ...
 $ playoffs : Factor w/ 2 levels "n","y": 2 2 2 1 2 2 2 1 2 1 ...

> head(receivers$Player)
[1] 15-G.Tate    84-A.Brown   10-E.Sanders 18-J.Maclin  88-D.Bryant
[6] 18-R.Cobb

Based on WPA, Golden Tate of Detroit is the highest ranked wide receiver; in contrast his highly-regarded teammate is ranked 13th.

We talked about WPA so here is a quick synopsis on the other variables; again, please go to the website for detailed explanations:

  • EPA - Expected Points Added
  • WPA_G - WPA per game
  • EPA_P - Expected Points Added per Play
  • SR_PERC - Success Rate of plays the receiver was involved that are considered successful
  • YPR - Yards Per Reception
  • Rec - Total Receptions
  • Yds - Total Reception Yards
  • RecTD - Receiving Touchdowns
  • Tgts - The times a receiver was targeted in the passing game
  • PER_Tgts - Percentage of time a team’s passes were targeted to the receiver
  • YPT - Yards per times targeted by a pass
  • C_PERC - Completion percentage
  • PERC_DEEP - Percent of passes targeted deep
  • playoffs - A factor I coded on whether the receiver’s team is in the playoffs or not

To do hierarchical clustering with this data, we can use the hclust() function available in base R.  In preparation for that, we should scale the data and we must create a distance matrix.

> r.df <- receivers[,c(4:18)]
> rownames(r.df) <- receivers[,2]
> scaled <- scale(r.df)
> d <- dist(scaled)

With the data prepared, produce the cluster object and plot it.

> hc <- hclust(d, method="ward.D")
> plot(hc, hang=-1, xlab="", sub="")

This is the standard dendrogram produced with hclust.  We now need to select the proper number of clusters and produce a dendrogram that is easier to examine.  For this, I found some interesting code to adapt on Gaston Sanchez’s blog: .  Since I am leaning towards 5 clusters, let’s first create a vector of colors.  (Note: you can find/search for color codes on

> labelColors = c("#FF0000",  "#800080","#0000ff", "#ff8c00","#013220")

Then use the cutree() function to specify 5 clusters

> clusMember = cutree(hc, 5)

Now, we create a function (courtesy of Gaston) to apply colors to the clusters in the dendrogram.

> colLab <- function(n) {
+   if (is.leaf(n)) {
+     a <- attributes(n)
+     labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]
+     attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
+   }
+   n
+ }

Finally, we turn “hc” into a dendrogram object and plot the new results.

> hcd <- as.dendrogram(hc)
> clusDendro = dendrapply(hcd, colLab)
> plot(clusDendro, main = "NFL Receiver Clusters", type = "triangle")

That is much better.  For more in-depth analysis you can put the clusters back into the original dataframe.

> receivers$cluster <- as.factor(cutree(hc, 5))

It is now rather interesting to plot the variables by cluster to examine the differences.  In the interest of time and space, I present just a boxplot of WPA by cluster.

> boxplot(WPA~cluster, data=receivers, main="Receiver Rank by Cluster")

Before moving on, I present this simple table of the clusters of the receivers by playoff qualification.  Interesting to note that cluster 1 with the high WPA, also has 10 of 13 receivers in the playoffs.  One of the things that would be worth a look I think is to adjust wide receiver WPA by some weight based on their QB quality.  Note that Randall Cobb and Jordy Nelson of the Packers have high WPA, ranked 6 and 7 respectively, but have the privilege of having Rodgers as QB.  Remember, in the quote above WPA does not have the ability to separate an individual’s success from a teammate’s success.  This raises some interesting questions for me that require further inquiry.

> table(receivers$cluster, receivers$playoff)
        n     y
  1    3   10
  2    4     2
  3    7     4
  4    4     0
  5    3     3

In closing the final blog of the year, I must make some predictions for the College Football playoffs.  I hate to say it, but I think Alabama will roll over Ohio State.  In the Rose Bowl, FSU comes from behind to win...again!  I’d really like to see the Ducks win it all, but I just don’t see their defense being of the quality to stop Winston when it counts, which will be in the fourth quarter.  ‘Bama has that defense, well, the defensive line and backers anyway.  Therefore, I have to give the Crimson Tide the nod in the championship.  The news is not all bad.  Nebraska finally let go of Bo Pelini.  I was ecstatic about his hire, but the paucity of top-notch recruits finally manifested itself with perpetual high-level mediocrity.  His best years were with Callaghan’s recruits, Ndamukong Suh among many others.  They should have hired Paul Johnson from Georgia Tech, at least it would have been fun and somewhat nostalgic to watch Husker football again.



Saturday, November 15, 2014

Causality in Time Series - A look at 2 R packages (CausalImpact and Changepoint)

My first blog post on

Tuesday, August 5, 2014

Results of the Readers' Survey

 First of all, let me say “Thank You” to all of the 357 people who completed the survey. I was hoping for 100, so needless to say the response blew away my expectations. This endeavor seems like a worthwhile effort to do once a year. Next year I will refine the questionnaire based on what I've learned. Perhaps this can this can evolve into more of a report on the state of the R community and less on me trying to determine what should be the focus of my blog.

The pleasant diversions of summer and the exigencies of work have precluded any effort on my part to work on the blog, but indeed there is no rest for the wicked and I feel compelled to get this posted.

As a review, the questionnaire consisted of just 13 questions:
  • Age
  • Gender
  • Education Level
  • Employment Status
  • Country
  • Years of Experience Using R
  • Ranking of Interest to See Blog Posts on Various Analysis Techniques (7-point Likert Scale)
  • Feature Selection
  • Text Mining
  • Data Visualization
  • Time Series
  • Generalized Additive Models
  • Interactive Maps
  • Favorite Topic (select from a list or write in your preference)

Keep in mind that the questions were designed to see what I should focus on in future posts, not to be an all inclusive or exhaustive list of topics/techniques.  My intent with this post is not to give a full review of the data, but to hit some highlights. 

The bottom line is that the clear winner for what readers want to see in the blog is Data Visualization. The other thing that both surprised and pleased me at the same time, is how global the blog readers are. The survey respondents were from 50 different countries! So, what I want to present here are some simple graphs and yes, visualizations, of the data I collected.

I will start out with the lattice package, move on to ggplot2, and conclude with some static maps. I'm still struggling with my preference over lattice and ggplot2. Maybe preference is not necessary and we can all learn to coexist? I'm leaning towards lattice with simple summaries and I think the code is easier overall. However, ggplot2 seems to be a little more flexible and powerful.
> library(lattice)
> attach(survey)
> table(gender) #why can't base R have better tables?
Female   Male 
    28    329
> barchart(education, main="Education", xlab="Count") #barchart of education level

This basic barchart in lattice needs a couple of improvements, which is simple enough.  I want to change the     color of the bars to black and to sort the bars from high to low.  You can use the table and sort functions to         accomplish it.
> barchart(sort(table(education)), main="Education", xlab="Count", col="black")

There are a number of ways to examine a Likert  Scale.  It is quite common in market research to treat a Likert   Scale as continuous data instead of ordinal.  Another technique is to report the “top-two box”, coding the            data as a binary variable.  Below, I just look at the frequency bars of each of the questions in one plot ggplot2   and a function I copied from the excellent ebook by Winston Chang, 
Cookbook for R, .
You can easily build a multiplot, first creating the individual plots:
> library(ggplot2)

> p1 = ggplot(survey, aes( + geom_bar(binwidth=0.5) + ggtitle("Feature Selection")               + theme(axis.title.x = element_blank()) + ylim(0,225)
> p2 = ggplot(survey, aes(x=text.mine)) + geom_bar(binwidth=0.5) + ggtitle("Text Mining"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p3 = ggplot(survey, aes(x=data.vis)) + geom_bar(binwidth=0.5, fill="green") + ggtitle("Data Visualization)    + theme(axis.title.x = element_blank()) + ylim(0,225)
> p4 = ggplot(survey, aes(x=tseries)) + geom_bar(binwidth=0.5) + ggtitle("Time Series"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p5 = ggplot(survey, aes(x=gam)) + geom_bar(binwidth=0.5) + ggtitle("Generalized Additive Models"+ theme(axis.title.x = element_blank()) + ylim(0,225)
> p6 = ggplot(survey, aes(x=imaps)) + geom_bar(binwidth=0.5) + ggtitle("Interactive Maps"+ theme(axis.title.x = element_blank()) + ylim(0,225)
Now, here is the function to create a multiplot:
> multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
+ require(grid)
+ # Make a list from the ... arguments and plotlist
+ plots <- c(list(...), plotlist)
+ numPlots = length(plots)
+ # If layout is NULL, then use 'cols' to determine layout
+ if (is.null(layout)) {
+ # Make the panel
+ # ncol: Number of columns of plots
+ # nrow: Number of rows needed, calculated from # of cols
+ layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
+ ncol = cols, nrow = ceiling(numPlots/cols))
+ }
+ if (numPlots==1) {
+ print(plots[[1]])
+ } else {
+ # Set up the page
+ grid.newpage()
+ pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
+ # Make each plot, in the correct location
+ for (i in 1:numPlots) {
+ # Get the i,j matrix positions of the regions that contain this subplot
+ matchidx <- == i, arr.ind = TRUE))
+ print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
+ layout.pos.col = matchidx$col))
+ }
+ }
+ }
And now, we can compare the 6 variables on one plot with one line of code with the clear winner data                 visualization.
> multiplot(p1, p2, p3, p4, p5, p6, cols=3)

The last thing I will cover in this post is how to do maps of the data on a global scale and compare two              packages: rworldmap and googleVis.
> library(rworldmap)
> #prepare the country data 
> df = data.frame(table(country))
> #join data to map
> join = joinCountryData2Map(df, joinCode="NAME",  nameJoinColumn="country", suggestForFailedCodes=TRUE) #one country observation was missing
> par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #enables a proper plot
This code takes the log of the observations.  Without the log values a map was produced that was of no value.    It also adjusts the size of the Legend as the default seems too large in comparison with the map. 
> map = mapCountryData( join, nameColumnToPlot="Freq", catMethod="logFixedWidth", addLegend=FALSE)
> addMapLegend, c(map, legendWidth=0.5, legendMar = 2))

Now, let's try googleVis.  This will produce the plot in html and it is also interactive.  That is, you can hover      your mouse over a country and get the actual counts.
> library(googleVis)
> G1 <- gvisGeoMap(df, locationvar='country', numvar='Freq', 
> plot(G1)

That's it for now.  I absolutely love working with data visualization on maps and will continue to try and            incorporate them in future analysis.  I can't wait to brutally overindulge and sink my teeth into the ggmap           package.  Until next time, Mahalo.

Saturday, June 28, 2014

R Blog Readers Survey

I've almost achieved my goal of a minimum of 100 quality responses.  I will keep the survey open for a couple more days, so if you haven't responded, please feel free to have a crack at it.