Translate

Monday, March 17, 2014

MoneyPuck - Best subsets regression of NHL teams

Spring is at hand and it is a time of renewal, March Madness and to settle scores in the NHL.  There are many scores to be settled: Flyers vs. Penguins, Blackhawks vs. Red Wings, Leafs vs. Habs and pretty much everyone else vs. the Bruins.  Like any fire and brimstone hockey fan, clutching my old testament (a well-worn VHS copy of Slapshot), I will be watching with passionate interest as the NHL unleashes a hellish intensity as it comes down the home stretch.  What does it take to win?  I stumbled upon team statistics on nhl.com and decided to have a look with some basic linear regression.  I created a simple .csv file of team stats with 24 variables for each of the 30 teams as of the morning of March 16.

 [1]  "Team"                   "gp"                     "wins"        
 [4]  "losses"                  "ot_losses"           "points"      
 [7]  "reg_ot"                 "home_row"         "road_row"    
[10]  "point_percent"     "g_game"             "ga_game"      
[13]  "fiveon5_ratio"      "play_percent"     "pk_percent"  
[16]  "shots_game"        "sa_game"           "sfirst_percent"
[19]  "trfirst_percent"    "lfirst_percent"     "lsecond_percent"
[22]  "win_outshoot"     "win_outshot"      "fo_percent"  
[25]  "pim"          

The dependent variable here will be "point_percent", which is the percentage points a team achieves per game played.  Since teams have played different numbers of games and because of the NHL point system (I will not get into that here) it is the best reflection of team performance.  I'll discuss other variables in a minute.

Let's get rid of unneeded variables...

> nhl.sub = nhl.3.16[ ,c(1,10:25)]
> #examine a correlation plot
> require(corrplot)
> nhl.corr = cor(nhl.sub[2:17])
> corrplot(nhl.corr, method="ellipse")


        

Setting aside goals scored and allowed per game (g_game, ga_game), we see that we have high correlation with the ratio of goals scored versus opponents when teams are at full strength (fiveon5_ratio), winning percentage when scoring first (sfirst_percent), winning when outshooting an opponent (win_outshoot) and when outshot (win_outshot).  I was surprised that penalty minutes per game (pim) was not correlated with anything.

> #re-make nhl.sub with no goals or goals allowed per game
> nhl.sub = nhl.3.16[ ,c(10,13:25)]
> require(leaps)
> #use the leaps package for best subsets
> #the package defaults to 8 subsets, but I set to 10 for demonstration purposes
> bestsubsets = regsubsets(point_percent~., nhl.sub, nvmax=10)
> #create an object of regsubsets for further evaluation and model selection
> reg.summary = summary(bestsubsets)
> names(reg.summary)
[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat"
[8] "obj"

> #plot of residual sum of squares
> plot(reg.summary$rss ,xlab="Number of Variables ",ylab="RSS", type="l")
















> #looks like somewhere between 4 and 7 variables would be an optimal model...plot Mallow's Cp next
> plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type="l")
















> #5 variables? let's confirm...
> which.min(reg.summary$cp )
[1] 5

> plot(bestsubsets ,scale="Cp")



















The way to read this plot is to look at the top (lowest cp) and a black box corresponds to a variable on the x axis.  So here we see that the 5 on 5 ratio, power play percentage, penalty kill percentage, percentage of wins when scoring first and percentage of wins when trailing after the first period make the cut.

> #examine the coefficients
> coef(bestsubsets ,5)
    (Intercept)           fiveon5_ratio       play_percent      pk_percent
   -0.220482142     0.180024350     0.004574047     0.002724056
     sfirst_percent     trfirst_percent
    0.303930780     0.277814196

Here is some additional analysis I did to look at an ANOVA table and examine residuals i.e. confirm our linear model assumptions, I won't bore you with the details:

> nhl.lm = lm(point_percent ~ fiveon5_ratio + play_percent + pk_percent + sfirst_percent + trfirst_percent)
> anova(nhl.lm)
> summary(nhl.lm)
> plot(nhl.lm)

> #create a plot of actual versus predicted with team names for your viewing pleasure
> nhl.3.16["pred"] = NA
> nhl.3.16$pred = predict(nhl.lm)
> require(ggplot2)
> p = ggplot(nhl.3.16, aes(x=pred, y=point_percent, label=Team))
> p + geom_point() + geom_text(size=3, hjust=-.1, vjust=0)


















By the way, the adjusted R-squared was 0.9614.  Interesting how we see the elite teams like St. Louis and Anaheim clustering together and the has-been teams clustering as well.  Teams like Tampa, Washington and Dallas will all be fighting it out for those playoff spots to the glorious and bitter end.

Mahalo,

Cory

Saturday, March 8, 2014

The "Fighting Sioux" Surge

Even a casual fan of North Dakota Hockey will notice that in the era of Coach Dave Hakstol, the team seems to perform better in the second half of the season than the first.  For the rabid fans of the team like myself, it has become a horrible and interesting fact of life.  By mid-December we are inevitably calling for Hakstol's head on a pike only to be whipped into a blood-thirsty frenzy as the team dominates the ice, slaughtering opponents with buzz-saw precision. The WCHA tourney championships and trips to the Frozen Four will attest to that.  Let's not get into the team's Frozen Four issues...just too painful to write about, too many emotional scars.

On February 28th, , Dave Berger posted on his siouxsports.com blog an excellent analysis of the surge, "The Second-Half Surge: Math or Myth?"

http://blog.siouxsports.com/2014/02/28/the-second-half-surge-math-or-myth-2/

Using the data Dave provided in his blog, I couldn't help but subject it to some analysis in R.  No fancy econometrics, big data or machine learning here, just some good old fashioned statistical tests on a small data set.  The data starts in the 2004-05 season and goes through last year.  In 7 of the 9 years, the team has a better winning percentage in the second half.

year            w1   l1 t1    g1 win1 w2 l2 t2    g2    win2
1 2004-05 13 7 2    22    0.591    12    8    3 23    0.522
2 2005-06 13 8 1    22    0.591 16    8    0    24    0.667
3 2006-07      9 10 1 20    0.450 15 4    4    23    0.652
4 2007-08 9 7 1 17    0.529    19 4 3 26    0.731
5 2008-09 9 10  1    20    0.450    15    5 3    23 0.652
6 2009-10      9    6 3    18 0.500 16 7 2    25    0.640
7 2010-11 14 5    2 21 0.667    18 3    1    22    0.818
8 2011-12    10    8    2    20 0.500    16    5    1    22    0.727
9 2012-13    10    5    3 18    0.556 12    8    4    24    0.500

The variables are 1st half wins (w1), losses (l1), ties (t1), total games (g1) and win percentage (win1), along with the corresponding data from the 2nd half of each season.

> #boxplots of wins to visualize the difference
> boxplot(win1,win2, main="Second Half Surge", ylab ="Win Percentage", names=c("1st Half", "2nd Half"))


> #paired t-test to compare the means; use a paired test because you have two measurements on the same "subject"
> t.test(win1,win2, paired=T)

Paired t-test

data:  win1 and win2
t = -3.1795, df = 8, p-value = 0.01301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.20607414 -0.03281474
sample estimates:
mean of the differences
             -0.1194444

> #p-value is less than .05, so we have achieved statistical significance; note the average difference in the means is about 12%

> #proportions test of all seasons
> #create objects of wins and total games
> w1.total = sum(w1)
> g1.total = sum(g1)
> w2.total = sum(w2)
> g2.total = sum(g2)

> #now run the proportions test with vectors, using the objects above
> prop.test(c(w1.total,w2.total), c(g1.total,g2.total))

2-sample test for equality of proportions with continuity
correction

data:  c(w1.total, w2.total) out of c(g1.total, g2.total)
X-squared = 4.9931, df = 1, p-value = 0.02545
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.21872805 -0.01394102
sample estimates:
   prop 1    prop 2
0.5393258 0.6556604

Another significant p-value.  North Dakota is winning almost 54% of their games in the 1st half of a season, but nearly 66% in the second half.  Take out those gut-wrenching Frozen Four losses and the results would be quite impressive.

On a side note, this season has been no different.  I made the trip last October 19th to Oxford, OH in full Sioux regalia only to watch then number 1 ranked Miami (OH) mercilessly shred North Dakota.  A couple of weeks ago, we returned the favor, ripping the lungs out of 'em on consecutive nights.  We are in the hunt once again and the pulse quickens as we go in for the kill.  I only hope we don't see Boston College at any point in the near future.

CL