Even a casual fan of North Dakota Hockey will notice that in the era of Coach Dave Hakstol, the team seems to perform better in the second half of the season than the first. For the rabid fans of the team like myself, it has become a horrible and interesting fact of life. By mid-December we are inevitably calling for Hakstol's head on a pike only to be whipped into a blood-thirsty frenzy as the team dominates the ice, slaughtering opponents with buzz-saw precision. The WCHA tourney championships and trips to the Frozen Four will attest to that. Let's not get into the team's Frozen Four issues...just too painful to write about, too many emotional scars.

On February 28th, , Dave Berger posted on his siouxsports.com blog an excellent analysis of the surge, "The Second-Half Surge: Math or Myth?"

http://blog.siouxsports.com/2014/02/28/the-second-half-surge-math-or-myth-2/

Using the data Dave provided in his blog, I couldn't help but subject it to some analysis in R. No fancy econometrics, big data or machine learning here, just some good old fashioned statistical tests on a small data set. The data starts in the 2004-05 season and goes through last year. In 7 of the 9 years, the team has a better winning percentage in the second half.

year w1 l1 t1 g1 win1 w2 l2 t2 g2 win2

1 2004-05 13 7 2 22 0.591 12 8 3 23 0.522

2 2005-06 13 8 1 22 0.591 16 8 0 24 0.667

3 2006-07 9 10 1 20 0.450 15 4 4 23 0.652

4 2007-08 9 7 1 17 0.529 19 4 3 26 0.731

5 2008-09 9 10 1 20 0.450 15 5 3 23 0.652

6 2009-10 9 6 3 18 0.500 16 7 2 25 0.640

7 2010-11 14 5 2 21 0.667 18 3 1 22 0.818

8 2011-12 10 8 2 20 0.500 16 5 1 22 0.727

9 2012-13 10 5 3 18 0.556 12 8 4 24 0.500

The variables are 1st half wins (w1), losses (l1), ties (t1), total games (g1) and win percentage (win1), along with the corresponding data from the 2nd half of each season.

> #boxplots of wins to visualize the difference

> boxplot(win1,win2, main="Second Half Surge", ylab ="Win Percentage", names=c("1st Half", "2nd Half"))

> #paired t-test to compare the means; use a paired test because you have two measurements on the same "subject"

> t.test(win1,win2, paired=T)

Paired t-test

data: win1 and win2

t = -3.1795, df = 8, p-value = 0.01301

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.20607414 -0.03281474

sample estimates:

mean of the differences

-0.1194444

> #p-value is less than .05, so we have achieved statistical significance; note the average difference in the means is about 12%

> #proportions test of all seasons

> #create objects of wins and total games

> w1.total = sum(w1)

> g1.total = sum(g1)

> w2.total = sum(w2)

> g2.total = sum(g2)

> #now run the proportions test with vectors, using the objects above

> prop.test(c(w1.total,w2.total), c(g1.total,g2.total))

2-sample test for equality of proportions with continuity

correction

data: c(w1.total, w2.total) out of c(g1.total, g2.total)

X-squared = 4.9931, df = 1, p-value = 0.02545

alternative hypothesis: two.sided

95 percent confidence interval:

-0.21872805 -0.01394102

sample estimates:

prop 1 prop 2

0.5393258 0.6556604

Another significant p-value. North Dakota is winning almost 54% of their games in the 1st half of a season, but nearly 66% in the second half. Take out those gut-wrenching Frozen Four losses and the results would be quite impressive.

On a side note, this season has been no different. I made the trip last October 19th to Oxford, OH in full Sioux regalia only to watch then number 1 ranked Miami (OH) mercilessly shred North Dakota. A couple of weeks ago, we returned the favor, ripping the lungs out of 'em on consecutive nights. We are in the hunt once again and the pulse quickens as we go in for the kill. I only hope we don't see Boston College at any point in the near future.

CL

You might also tackle this by logistic regression. Here is a suggestion, starting with reading in the data and massaging it into a convenient form:

ReplyDeletehock <- read.csv("hockey.csv") ## assuming the data is in .csv form

H1 <- subset(hock, select = year:win1)

H2 <- subset(hock, select = c(year, w2:win2))

names(H2) <- names(H1)

Hock <- rbind(cbind(Half = "First", H1),

cbind(Half = "Second", H2))

fm <- glm(cbind(w1, l1) ~ Half year, binomial, Hock)

summary(fm) ## output omitted here

requireData(MASS)

dropterm(fm, test = "Chisq")

## Single term deletions

## Model:

## cbind(w1, l1) ~ Half + year

## Df Deviance AIC LRT Pr(Chi)

## 6.9336 85.832

## Half 1 14.0982 90.997 7.1646 0.007436

## year 8 12.1759 75.075 5.2423 0.731397

___

Looks clear that Half is important and year is not - but need to follow up on how big the effect is.