Translate

Monday, July 29, 2013

Quandl.com for Time Series Datasets

If you want to dig in with both feet on time series data, then quandl.com is a good choice.  The website claims to have several million datasets all of them available for free download.  It also allows you to upload data to the site with an authentication token.
Well, the site says it is easy to get data from their API directly into R, so let's have a crack at it.
My source for this code is Quandl itself:  http://www.quandl.com/help/packages/r
> #install the package
> install.packages("Quandl")
> library(Quandl)
> #search Quandl for a dataset on new home construction in the US
> Quandl.search("new home construction") #default is to display the first 3 results

Construction Employment in New Hampshire
Code: FRED/NHCONS
Desc: Thousands of Persons Seasonally Adjusted, 
Freq: monthly
Cols: Date|Value

Construction Employment in New Mexico
Code: FRED/NMCONS
Desc: Thousands of Persons Seasonally Adjusted, 
Freq: monthly
Cols: Date|Value

Construction Employment in New York
Code: FRED/NYCONS
Desc: Thousands of Persons Seasonally Adjusted, 
Freq: monthly
Cols: Date|Value

#That sux, so let's refine the search as there is nothing useful in that output

> Quandl.search("Housing Units Completed", source="FRED") #added source here and FRED stands for Federal Reserve

New Privately-Owned Housing Units Completed: Total
Code: FRED/COMPUTSA
Desc: Thousands of Units Seasonally Adjusted Annual Rate, 
Freq: monthly
Cols: Date|Value

New Privately-Owned Housing Units Completed: Total
Code: FRED/COMPUTNSA
Desc: Thousands of Units Not Seasonally Adjusted, 
Freq: monthly
Cols: Date|Value

New Privately-Owned Housing Units Completed in the South Census Region
Code: FRED/COMPUSTSA
Desc: Thousands of Units Seasonally Adjusted Annual Rate, 
Freq: monthly
Cols: Date|Value

# I will settle for Units Completed without seasonal adjustment; The dataset code is "FRED/COMPUTNSA"

> Units = Quandl("FRED/COMPUTSA") #download the data
> head(Units)
        Date          Value
1  1968-01-01  1257
2  1968-02-01  1174
3  1968-03-01  1323
4  1968-04-01  1328
5  1968-05-01  1367
6  1968-06-01  1184


There you have it.  I plan to explore the Changepoint package in an upcoming blog using this dataset.

Tuesday, July 23, 2013

Postscript to Data Visualization

Much to my chagrin, I realized I forgot to include one of the more interesting features in the lattice package. You can quickly turn a quantitative variable into one of levels of equal counts.  This provides a nice way of looking at slices of your data in a trellis plot.  These slices are referred to as shingles as they overlap according to your specification.

Here is a simple code example, using the quakes data included with R:

> library(lattice)
> head(quakes)
     lat          long    depth mag stations
1 -20.42 181.62   562    4.8       41
2 -20.62 181.03   650    4.2       15
3 -26.00 184.10    42     5.4       43
4 -17.97 181.66   626    4.1       19
5 -20.42 181.96   649    4.0       11
6 -19.68 184.31   195    4.0       12
#Build an xyplot of lat and long conditioned by depth
> depth = quakes$depth
> range(depth)

[1]  40 680
> Depth = equal.count(depth, number=6, overlap=.05)
> # with equal.count we now have sliced depth into 6 parts with an overlap of 5%
plot(Depth)





















xyplot(lat~long | Depth, data=quakes)

The way to read this plot is starting from the bottom left working right then the next row up from left to right.


































Enjoy.

TDM

Saturday, July 20, 2013

A Quick and Dirty Guide to Exploratory Data Visualization

One of the things I've noticed teaching statistical fundamentals or working with colleagues is the lack of focus on first visually exploring the data.  Novices seem to want to jump right in with correlations and statistical tests without getting a "feel" of what they are examining.  The Germans have an appropriate term I think in "fingerspitzengefuhl", which literally means "finger tips feeling".  Visualization of the data can provide this and that is a selling point of using R.  The community comments and packages available are seemingly endless.  My goal in this post is to examine, at a high-level, the Lattice and vcd packages.  The data set is based on the world's largest metaphor hitting an iceberg;  that's right, the Titanic. 
I downloaded the data set from kaggle.com and it is part of one of their competitions.  It consists of the following variables:
survived = did the passenger survive or not
plcass = passenger class
name = passenger name
sex = passenger sex
age = passenger age
sibsp = the number of siblings/spouses aboard
parch = the number of parents/children aboard
ticket = ticket number
fare = passenger fare
cabin = passenger cabin number
embarked = Port of Embarkation; either Cherbourg, Southampton or Queenstown
home.dest = passenger home and eventual destination
The contest is seeking the model that best predicts passenger survival (variable survived) and the website offers several tutorials to get contestants started.  This is an interesting data set and one I think is open to provide examples of the power of simple data visualization.
I've loaded the data into R, calling it titan1 and we can see it consists of 1309 observations.
> str(titan1)
'data.frame':   1309 obs. of  12 variables:
 $ survived : Factor w/ 2 levels "dead","survive": 2 2 1 1 1 2 2 1 2 1 ...
 $ pclass   : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex      : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age      : num  29 0.917 2 30 25 ...
 $ sibsp    : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch    : int  0 2 2 2 2 0 0 0 0 0 ...
 $ fare     : num  211 152 152 152 152 ...
 $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
 $ ticket   : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
 $ cabin    : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
 $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...
 $ name     : Factor w/ 1307 levels "Abbing, Mr. Anthony",..:

Notice that embarked is telling us it has 4 levels, including missing data ("").  This is a pesky problem with factors, which took me a while to figure out how to get rid of.  I believe the easiest way to deal with it is when loading the .csv to have as an option(stringsAsFactors = FALSE). 
We will delete the missing observations, but first let's get rid of variables I'm not interested in (ticket, cabin, home.dest and name).
> titan2 = titan1[c(-9,-10,-11,-12)] #create a subset by dropping variables
> names(titan2)
[1] "survived" "pclass"   "sex"      "age"      "sibsp"    "parch"    "fare"     "embarked"
> which(titan2$embarked == "")  #find those pesky "" in embarked
[1] 169 285
> levels(titan2$embarked) = c(NA, "C", "Q", "S")  #replace "" with NA, again stringsAsFactors = FALSE is better during data upload
> levels(titan2$embarked)
[1] "C" "Q" "S"

> str(titan2)  #confirm above
'data.frame':   1309 obs. of  8 variables:
 $ survived: Factor w/ 2 levels "dead","survive": 2 2 1 1 1 2 2 1 2 1 ...
 $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
 $ fare    : num  211 152 152 152 152 ...
 $ embarked: Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1

> titan3 = na.omit(titan2)  # delete missing observations

> str(titan3) #structure of the data ready for analysis

'data.frame':   1043 obs. of  8 variables:
 $ survived: Factor w/ 2 levels "dead","survive": 2 2 1 1 1 2 2 1 2 1 ...
 $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
 $ fare    : num  211 152 152 152 152 ...
 $ embarked: Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:266] 16 38 41 47 60 70 71 75 81 107 ...
  .. ..- attr(*, "names")= chr [1:266] "16" "38" "41" "47" ...

OK, I shall now put the lattice package through its paces, putting together a number of trellis plots.
> library(lattice)  #load the package
> trellis.device()  #this is called the trellis aware device

Trellis plots via lattice are an effective way to display multivariate data.  The code follows the format of...graphtype(formula, data=...).
Let's do a simple boxplot on age by passenger survival.
> boxplot(age~survived, ylab= "Passenger Age",  data=titan3)



> xyplot(pclass~age | survived, data=titan3) #this plot looks at class by age "conditioned" on survival





Looks like the youth in 1st and 2nd class stood a much better chance of survival than in 3rd class
> dotplot(survived~age | pclass, data=titan3) #trying dotplot to examine this in a different way




After much trial and error I find this plot to be informative.  Females in 1st and 2nd class seemed to have a much better chance of survival than any other group.  Of the males, only 1st and 2nd class youth stood much of a chance.
> xyplot(age~survived | sex * pclass, data=titan3)



> bwplot(age~survived | pclass, data=titan3, layout=c(3,1))  #apparent visual confirmation; you could condition by sex also



I really like mosaic plots (leftover from my JMP days) in looking at nominal data.  You can use the what comes in the standard R package.

mosaicplot(~survived + pclass, data=titan3, color=TRUE)  #2 factors examined in a mosaic plot; you can change colors e.g. color=3:4 etc.



> library(vcd) #trying something new in mosaic plots by using the vcd package
mosaic(~ survived + sex | pclass, data=titan3, main = "Titanic Survival", shade = TRUE, legend = TRUE) #plot conditioned by pclass; much better 'eh?



spine(survived~age, data=titan3, breaks=8) #spine chart in vcd; 1 factor and 1 numeric variable broken into 8 intervals with the breaks argument.



This just scratches the surface (haven't used ggplot yet).  I have tried generalized pair plots using GGally, but haven't found them that insightful once you get over 4 or 5 variables.  I would appreciate any further  recommendations and tips/tricks on visualization with R.


T.D. Meister

Sunday, July 14, 2013

Partial Least Squares Regression in R

Partial Least Squares Regression:

This week I will be doing some consulting around Structural Equation Modeling (SEM) techniques to solve a unique business problem.  We are trying to identify customer preference for various products and traditional regression is not adequate because of the high dimensional component to the data set along with the multi-colinearity of the variables.  PLS is a powerful and effective method to handle these sorts of problematic data sets.

Principal Components regression is one option we will explore, but in doing background research I have found that PLS may be a better option.  We will look at both PLS regression and PLS path analysis.  I don't believe traditional SEM will be of value at this point as we don't have a good feel or theory to make assumptions on the latent structure.  Also, because of the number of variables in the data set, we are stretching SEM techniques to the limit.  An interesting discussion of this limitation can be found in Haenlein, M & Kaplan, A., 2004, "A Beginner's Guide to Partial Least Squares Analysis", Understanding Statistics, 3(4), 283-297.

Of course, I want to do this in R and a couple of packages exist.  My favorite ones come from the researcher Gaston Sanchez with an ebook and tutorials on his website gastonsanchez.com.

# load the plsdepot package and a dataset
> library(plsdepot)
Warning message:
package ‘plsdepot’ was built under R version 3.0.1
> data(vehicles)
> head(vehicles)

> names(vehicles)
 [1] "diesel"      "turbo"       "two.doors"   "hatchback"   "wheel.base"
 [6] "length"      "width"       "height"      "curb.weight" "eng.size"
[11] "horsepower"  "peak.rpm"    "price"       "symbol"      "city.mpg"
[16] "highway.mpg"

This data has 16 variables and 30 observations.  It is included in the plsdepot package.

One of the interesting things about PLS regression is you can have multiple response variables and plsdepot can accommodate that type of analysis.  In this case, I just want to analyze one Y variable and that will be price.

One of the quirks of the package is you will need to have the predictors and responses separated i.e. put the response variable column(s) at the end of your dataframe.  To do that I simply run this elegant bit of code I found somewhere...
# put the variable price (column 13) at the end
 cars = vehicles[ ,c(1:12,14:16,13)]

Here is the code to build the model and in this case we will examine it with 3 components (latent variables) by using comps=3.  We use plsreg1 in this case because we are only interested in price.  If we had multiple Ys then plsreg2 would be used.

> pls1 = plsreg1(cars[, 1:15], cars[, 16, drop = FALSE], comps = 3)

# what options are available in plsreg1?
>pls1


$x.scores          X-scores (T-components)
$x.loads            X-loadings
$y.scores          Y-scores (U-components)
$y.loads            Y-loadings
$cor.xyt             score correlations
$raw.wgs          raw weights
$mod.wgs        modified weights
$std.coefs        standard coefficients
$reg.coefs        regular coefficients
$R2                      R-squared
$R2Xy                 explained variance of X-y by T
$y.pred              y-predicted
$resid                 residuals
$T2                      T2 hotelling
$Q2                     Q2 cross validation

There is a lot here in this package and I highly recommend going through the excellent tutorials to understand more.  At any rate, let me highlight of couple of things.
#R2 for each of our components
> pls1$R2
        t1                          t2                        t3 
0.70474894     0.11708109      0.09872169 


> # correlations plot; notice what is highly correlated with price

> plot(pls1)




#plot each observation predicted versus actual
> plot(cars$price, pls1$y.pred, type = "n", xlab="Original", ylab = "Predicted")
> title("Comparison of responses", cex.main = 0.9)
> abline(a = 0, b = 1, col = "gray85", lwd = 2)
> text(cars$price, pls1$y.pred, col = "#5592e3")

This is not a bad start.  We would have to continue to look at different numbers of components to identify the best model and to see if the latent variables make sense from a practical standpoint.