Translate

Saturday, July 20, 2013

A Quick and Dirty Guide to Exploratory Data Visualization

One of the things I've noticed teaching statistical fundamentals or working with colleagues is the lack of focus on first visually exploring the data.  Novices seem to want to jump right in with correlations and statistical tests without getting a "feel" of what they are examining.  The Germans have an appropriate term I think in "fingerspitzengefuhl", which literally means "finger tips feeling".  Visualization of the data can provide this and that is a selling point of using R.  The community comments and packages available are seemingly endless.  My goal in this post is to examine, at a high-level, the Lattice and vcd packages.  The data set is based on the world's largest metaphor hitting an iceberg;  that's right, the Titanic. 
I downloaded the data set from kaggle.com and it is part of one of their competitions.  It consists of the following variables:
survived = did the passenger survive or not
plcass = passenger class
name = passenger name
sex = passenger sex
age = passenger age
sibsp = the number of siblings/spouses aboard
parch = the number of parents/children aboard
ticket = ticket number
fare = passenger fare
cabin = passenger cabin number
embarked = Port of Embarkation; either Cherbourg, Southampton or Queenstown
home.dest = passenger home and eventual destination
The contest is seeking the model that best predicts passenger survival (variable survived) and the website offers several tutorials to get contestants started.  This is an interesting data set and one I think is open to provide examples of the power of simple data visualization.
I've loaded the data into R, calling it titan1 and we can see it consists of 1309 observations.
> str(titan1)
'data.frame':   1309 obs. of  12 variables:
 $ survived : Factor w/ 2 levels "dead","survive": 2 2 1 1 1 2 2 1 2 1 ...
 $ pclass   : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex      : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age      : num  29 0.917 2 30 25 ...
 $ sibsp    : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch    : int  0 2 2 2 2 0 0 0 0 0 ...
 $ fare     : num  211 152 152 152 152 ...
 $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
 $ ticket   : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
 $ cabin    : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
 $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...
 $ name     : Factor w/ 1307 levels "Abbing, Mr. Anthony",..:

Notice that embarked is telling us it has 4 levels, including missing data ("").  This is a pesky problem with factors, which took me a while to figure out how to get rid of.  I believe the easiest way to deal with it is when loading the .csv to have as an option(stringsAsFactors = FALSE). 
We will delete the missing observations, but first let's get rid of variables I'm not interested in (ticket, cabin, home.dest and name).
> titan2 = titan1[c(-9,-10,-11,-12)] #create a subset by dropping variables
> names(titan2)
[1] "survived" "pclass"   "sex"      "age"      "sibsp"    "parch"    "fare"     "embarked"
> which(titan2$embarked == "")  #find those pesky "" in embarked
[1] 169 285
> levels(titan2$embarked) = c(NA, "C", "Q", "S")  #replace "" with NA, again stringsAsFactors = FALSE is better during data upload
> levels(titan2$embarked)
[1] "C" "Q" "S"

> str(titan2)  #confirm above
'data.frame':   1309 obs. of  8 variables:
 $ survived: Factor w/ 2 levels "dead","survive": 2 2 1 1 1 2 2 1 2 1 ...
 $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
 $ fare    : num  211 152 152 152 152 ...
 $ embarked: Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1

> titan3 = na.omit(titan2)  # delete missing observations

> str(titan3) #structure of the data ready for analysis

'data.frame':   1043 obs. of  8 variables:
 $ survived: Factor w/ 2 levels "dead","survive": 2 2 1 1 1 2 2 1 2 1 ...
 $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
 $ fare    : num  211 152 152 152 152 ...
 $ embarked: Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:266] 16 38 41 47 60 70 71 75 81 107 ...
  .. ..- attr(*, "names")= chr [1:266] "16" "38" "41" "47" ...

OK, I shall now put the lattice package through its paces, putting together a number of trellis plots.
> library(lattice)  #load the package
> trellis.device()  #this is called the trellis aware device

Trellis plots via lattice are an effective way to display multivariate data.  The code follows the format of...graphtype(formula, data=...).
Let's do a simple boxplot on age by passenger survival.
> boxplot(age~survived, ylab= "Passenger Age",  data=titan3)



> xyplot(pclass~age | survived, data=titan3) #this plot looks at class by age "conditioned" on survival





Looks like the youth in 1st and 2nd class stood a much better chance of survival than in 3rd class
> dotplot(survived~age | pclass, data=titan3) #trying dotplot to examine this in a different way




After much trial and error I find this plot to be informative.  Females in 1st and 2nd class seemed to have a much better chance of survival than any other group.  Of the males, only 1st and 2nd class youth stood much of a chance.
> xyplot(age~survived | sex * pclass, data=titan3)



> bwplot(age~survived | pclass, data=titan3, layout=c(3,1))  #apparent visual confirmation; you could condition by sex also



I really like mosaic plots (leftover from my JMP days) in looking at nominal data.  You can use the what comes in the standard R package.

mosaicplot(~survived + pclass, data=titan3, color=TRUE)  #2 factors examined in a mosaic plot; you can change colors e.g. color=3:4 etc.



> library(vcd) #trying something new in mosaic plots by using the vcd package
mosaic(~ survived + sex | pclass, data=titan3, main = "Titanic Survival", shade = TRUE, legend = TRUE) #plot conditioned by pclass; much better 'eh?



spine(survived~age, data=titan3, breaks=8) #spine chart in vcd; 1 factor and 1 numeric variable broken into 8 intervals with the breaks argument.



This just scratches the surface (haven't used ggplot yet).  I have tried generalized pair plots using GGally, but haven't found them that insightful once you get over 4 or 5 variables.  I would appreciate any further  recommendations and tips/tricks on visualization with R.


T.D. Meister

No comments:

Post a Comment