**“Statistics are used much like a drunk uses a lamppost: for support, not illumination.”**
Recently at work I stumbled into a classic example of
Simpson's Paradox (Yule-Simpson Effect). Many people are not familiar with this
phenomena, so I thought I would provide a brief example to help shed some light
on it. If you are not aware of the problem of confounding or
"lurking" variables in your analysis, you can box yourself into an
analytical corner. To further understand what is happening in a Simpson's
Paradox check out the following links:

In short what is happening is that a trend or association
that appears between two different variables reverses when a third variable is
included. You will stumble into or at
least need to be on the lookout for this effect of spurious correlation when
you have unbalanced group sizes, like you often have using observational data.
This is what happened in my recent discovery.

In the example below, let's have a look at the UC Berkeley
data, or at least the portion of it by Department, as provided in a Wikipedia
article. https://en.wikipedia.org/wiki/Simpson%27s_paradox#cite_note-Bickel-11

What the data we explore contains is the number of applications and admissions by gender to six different graduate schools. Again, this is just a portion of the data, focusing in on the largest departments.

dpt = c("A","B","C","D","E","F","A","B","C","D","E","F")

app = c(825,560,325,417,191,272,108,25,593,375,393,341)

adm = c(512,353,120,138,53,16,89,17,202,131,94,24)

gen = c("m","m","m","m","m","m","f","f","f","f","f","f")

df = cbind.data.frame(dpt,app,adm,gen)

str(df)

## 'data.frame': 12 obs. of
4 variables:

## $ dpt: Factor w/ 6 levels "A","B","C","D",..: 1 2 3 4 5 6 1 2 3 4 ...

## $ app: num 825 560 325 417 191 272 108 25 593 375 ...

## $ adm: num 512 353 120 138 53 16 89 17 202 131 ...

## $ gen: Factor w/ 2 levels "f","m": 2 2 2 2 2 2 1 1 1 1 ...

## $ dpt: Factor w/ 6 levels "A","B","C","D",..: 1 2 3 4 5 6 1 2 3 4 ...

## $ app: num 825 560 325 417 191 272 108 25 593 375 ...

## $ adm: num 512 353 120 138 53 16 89 17 202 131 ...

## $ gen: Factor w/ 2 levels "f","m": 2 2 2 2 2 2 1 1 1 1 ...

There are a number of ways to demonstrate the
effect, but I thought I would give it a go using dplyr. First, let's group the
data by gender (gen) then look at their overall admission rate.

library(dplyr)

by_gen = group_by(df, gen)

summarize(by_gen, adm_rate=sum(adm)/sum(app))

summarize(by_gen, adm_rate=sum(adm)/sum(app))

## Source: local data frame
[2 x 2]

##

## gen adm_rate

## 1 f 0.3035422

## 2 m 0.4602317

##

## gen adm_rate

## 1 f 0.3035422

## 2 m 0.4602317

Clearly there is a huge gender bias problem in graduate
school admissions at UC Berkeley and it is time to round up a herd of lawyers.
On a side note, what is the best way to save a drowning lawyer? It's easy, just
take your foot off their head.

We can even provide a statistical test to
strengthen the assertion of bias. In R, a proportions test can be done via the
prop.test() function, inputting the number of "hits" and the number of
"trials".

summarize(by_gen, sum(adm))

## Source: local data frame
[2 x 2]

##

## gen sum(adm)

## 1 f 557

## 2 m 1192

##

## gen sum(adm)

## 1 f 557

## 2 m 1192

summarize(by_gen, sum(app))

## Source: local data frame
[2 x 2]

##

## gen sum(app)

## 1 f 1835

## 2 m 2590

##

## gen sum(app)

## 1 f 1835

## 2 m 2590

prop.test(x=c(557,1192), n=c(1835,2590))

##

## 2-sample test for equality of proportions with continuity

## correction

##

## data: c(557, 1192) out of c(1835, 2590)

## X-squared = 109.67, df = 1, p-value < 2.2e-16

## alternative hypothesis: two.sided

## 95 percent confidence interval:

## -0.1856333 -0.1277456

## sample estimates:

## prop 1 prop 2

## 0.3035422 0.4602317

## 2-sample test for equality of proportions with continuity

## correction

##

## data: c(557, 1192) out of c(1835, 2590)

## X-squared = 109.67, df = 1, p-value < 2.2e-16

## alternative hypothesis: two.sided

## 95 percent confidence interval:

## -0.1856333 -0.1277456

## sample estimates:

## prop 1 prop 2

## 0.3035422 0.4602317

We see in the output a high level of statistical
significance and a confidence interval with a difference of roughly 13 to 19
percent. However, beware of the analyst proclaiming truths using a 2x2 table
with observational data! In this instance, the confounding variable is the
department (dpt).

In order to have a look at the data by gender and by
department, we will once again use the group_by() function, then calculate the
admission rates and finally turn it from "long" to "wide"
format.

by_dpt = group_by(df, dpt,gen)

df2 = as.data.frame(summarize(by_dpt, adm_rate=sum(adm)/sum(app)))

df2

df2 = as.data.frame(summarize(by_dpt, adm_rate=sum(adm)/sum(app)))

df2

## dpt gen adm_rate

## 1 A f 0.82407407

## 2 A m 0.62060606

## 3 B f 0.68000000

## 4 B m 0.63035714

## 5 C f 0.34064081

## 6 C m 0.36923077

## 7 D f 0.34933333

## 8 D m 0.33093525

## 9 E f 0.23918575

## 10 E m 0.27748691

## 11 F f 0.07038123

## 12 F m 0.05882353

## 1 A f 0.82407407

## 2 A m 0.62060606

## 3 B f 0.68000000

## 4 B m 0.63035714

## 5 C f 0.34064081

## 6 C m 0.36923077

## 7 D f 0.34933333

## 8 D m 0.33093525

## 9 E f 0.23918575

## 10 E m 0.27748691

## 11 F f 0.07038123

## 12 F m 0.05882353

library(reshape2)

df2_wide = dcast(df2, dpt~gen, value.var="adm_rate")

df2_wide

df2_wide = dcast(df2, dpt~gen, value.var="adm_rate")

df2_wide

## dpt f m

## 1 A 0.82407407 0.62060606

## 1 A 0.82407407 0.62060606

##
2 B 0.68000000 0.63035714

## 3 C 0.34064081 0.36923077

## 4 D 0.34933333 0.33093525

## 5 E 0.23918575 0.27748691

## 6 F 0.07038123 0.05882353

## 3 C 0.34064081 0.36923077

## 4 D 0.34933333 0.33093525

## 5 E 0.23918575 0.27748691

## 6 F 0.07038123 0.05882353

With the inclusion of department we now see that females
actually have higher admission rates in four of the six departments (A,B,D,F).
How can this be? It had to do with rates and the volume of admission to the
various departments. Again, the groups were highly unbalanced. Alright, enough
of this as it is now time to focus on something important as Ohio State is
squaring off against Notre Dame and the NHL Winter Classic!

Good points and important for all to consider then analyzing any data set, big or otherwise.

ReplyDelete