Translate

Friday, January 1, 2016

Simpson's Paradox

“Statistics are used much like a drunk uses a lamppost: for support, not illumination.”
A.E. Housman (commonly attributed to Andrew Lang)


Recently at work I stumbled into a classic example of Simpson's Paradox (Yule-Simpson Effect). Many people are not familiar with this phenomena, so I thought I would provide a brief example to help shed some light on it. If you are not aware of the problem of confounding or "lurking" variables in your analysis, you can box yourself into an analytical corner. To further understand what is happening in a Simpson's Paradox check out the following links:



In short what is happening is that a trend or association that appears between two different variables reverses when a third variable is included.  You will stumble into or at least need to be on the lookout for this effect of spurious correlation when you have unbalanced group sizes, like you often have using observational data. This is what happened in my recent discovery.
In the example below, let's have a look at the UC Berkeley data, or at least the portion of it by Department, as provided in a Wikipedia article. https://en.wikipedia.org/wiki/Simpson%27s_paradox#cite_note-Bickel-11

What the data we explore contains is the number of applications and admissions by gender to six different graduate schools. Again, this is just a portion of the data, focusing in on the largest departments.

dpt = c("A","B","C","D","E","F","A","B","C","D","E","F")
app = c(825,560,325,417,191,272,108,25,593,375,393,341)
adm = c(512,353,120,138,53,16,89,17,202,131,94,24)
gen = c("m","m","m","m","m","m","f","f","f","f","f","f")
df = cbind.data.frame(dpt,app,adm,gen)
str(df)


## 'data.frame':    12 obs. of  4 variables:
##  $ dpt: Factor w/ 6 levels "A","B","C","D",..: 1 2 3 4 5 6 1 2 3 4 ...
##  $ app: num  825 560 325 417 191 272 108 25 593 375 ...
##  $ adm: num  512 353 120 138 53 16 89 17 202 131 ...
##  $ gen: Factor w/ 2 levels "f","m": 2 2 2 2 2 2 1 1 1 1 ...

There are a number of ways to demonstrate the effect, but I thought I would give it a go using dplyr. First, let's group the data by gender (gen) then look at their overall admission rate.

library(dplyr)
by_gen = group_by(df, gen)
summarize(by_gen, adm_rate=sum(adm)/sum(app))
## Source: local data frame [2 x 2]
##
##   gen    adm_rate
## 1   f   0.3035422
## 2   m 0.4602317

Clearly there is a huge gender bias problem in graduate school admissions at UC Berkeley and it is time to round up a herd of lawyers. On a side note, what is the best way to save a drowning lawyer? It's easy, just take your foot off their head.

We can even provide a statistical test to strengthen the assertion of bias. In R, a proportions test can be done via the prop.test() function, inputting the number of "hits" and the number of "trials".

summarize(by_gen, sum(adm))
## Source: local data frame [2 x 2]
##
##   gen sum(adm)
## 1   f         557
## 2   m     1192

summarize(by_gen, sum(app))
## Source: local data frame [2 x 2]
##
##   gen sum(app)
## 1   f       1835
## 2   m     2590

prop.test(x=c(557,1192), n=c(1835,2590))
##
##  2-sample test for equality of proportions with continuity
##  correction
##
## data:  c(557, 1192) out of c(1835, 2590)
## X-squared = 109.67, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1856333 -0.1277456
## sample estimates:
##        prop 1       prop 2
## 0.3035422 0.4602317


We see in the output a high level of statistical significance and a confidence interval with a difference of roughly 13 to 19 percent. However, beware of the analyst proclaiming truths using a 2x2 table with observational data! In this instance, the confounding variable is the department (dpt).

In order to have a look at the data by gender and by department, we will once again use the group_by() function, then calculate the admission rates and finally turn it from "long" to "wide" format.

by_dpt = group_by(df, dpt,gen)
df2 = as.data.frame(summarize(by_dpt, adm_rate=sum(adm)/sum(app)))
df2
##    dpt  gen         adm_rate
## 1     A    f     0.82407407
## 2     A   m    0.62060606
## 3     B    f     0.68000000
## 4     B   m    0.63035714
## 5     C    f     0.34064081
## 6     C   m    0.36923077
## 7     D    f     0.34933333
## 8     D   m    0.33093525
## 9     E    f      0.23918575
## 10   E   m     0.27748691
## 11   F    f      0.07038123
## 12   F   m     0.05882353

library(reshape2)
df2_wide = dcast(df2, dpt~gen, value.var="adm_rate")
df2_wide
##   dpt                  f               m
## 1   A  0.82407407  0.62060606
## 2   B  0.68000000  0.63035714
## 3   C  0.34064081  0.36923077
## 4   D  0.34933333  0.33093525
## 5   E  0.23918575   0.27748691
## 6   F  0.07038123   0.05882353

With the inclusion of department we now see that females actually have higher admission rates in four of the six departments (A,B,D,F). How can this be? It had to do with rates and the volume of admission to the various departments. Again, the groups were highly unbalanced. Alright, enough of this as it is now time to focus on something important as Ohio State is squaring off against Notre Dame and the NHL Winter Classic!

1 comment:

  1. Good points and important for all to consider then analyzing any data set, big or otherwise.

    ReplyDelete