Translate

Tuesday, June 25, 2013

Getting started with R


I wanted to avoid advanced topics in this post and focus on some “blocking and tackling” with R in an effort to get novices started.  This is some of the basic code I found useful when I began using R just over 6 weeks ago.
Reading in data from a .csv file is a breeze with this command.
> data = read.csv(file.choose())
No need to have your own data set as R comes with data packages already.
> data()  #list the datasets available in R
> # load the dataset 'cars' and display the variables
> data(cars)
> head(cars)
  speed   dist
1     4      2
2     4     10
3     7      4
4     7     22
5     8     16
6     9     10
#the command head() gives shows we have two variables, car speed and stopping distance along with the first 6 rows of data
#using attach() splits the data into separate columns and avoids having to use what I feel is the pesky $
> attach(cars)
# descriptive statistics of our two variables
> summary(cars)
     speed               dist      
 Min.   : 4.0          Min.   :  2.00 
 1st Qu.:12.0       1st Qu.: 26.00 
 Median :15.0     Median : 36.00 
 Mean   :15.4      Mean   : 42.98 
 3rd Qu.:19.0      3rd Qu.: 56.00 
 Max.   :25.0       Max.   :120.00 

> # univariate plots for speed

> plot(speed)



> hist(speed)

Tuesday, June 11, 2013

Visual Exploration of Time Series

A couple of weeks ago, I stumbled across the following post on using R to discover patterns in time series.
http://dahtah.wordpress.com/2013/05/17/finding-patterns-in-time-series-using-regular-expressions/#comments

The author examined a univariate time series of Australian GDP, looking for recessions, defined by two consecutive quarters of GDP growth.  The technique can be used to examine multiple time series, looking for correlations etc.  One could use this on the chicken versus egg data set I blogged about last week!  Having looked at commodities of late, I applied this to a data set of daily prices since January 1st for the exchange traded funds tracking gold (GLD)and oil prices (USO).  Why gold and oil?  Well, why not!  Enjoy.

> data = read.csv(file.choose())
> attach(data)
> head(data)
      Date    GLD   USO   UGA  CORN
1 1/2/2013 163.17 33.82 59.08 43.66
2 1/3/2013 161.20 33.74 58.87 43.44
3 1/4/2013 160.44 33.88 58.40 42.77
4 1/7/2013 159.43 33.92 58.94 42.92
5 1/8/2013 160.56 33.96 59.02 43.20
6 1/9/2013 160.49 33.88 58.63 43.56
> # data set includes gas prices (UGA) and corn prices (CORN) but let's ignore them

> par(mfrow=c(2,1)) #make a 2x1 plot
> delta = (sign(diff(GLD)) == 1) + 0 #getting started with GLD
> head(delta)
[1] 0 0 0 1 0 1
> #we now have the differenced data for GLD
> ds1 = do.call(paste0, as.list(c(delta)))
> #setting the code to highlight 3 days in a row of price decline (the "000+")
> matches1 = gregexpr("000+", ds1, perl = T)[[1]]

> matches1 = gregexpr("000+", ds1, perl = T)[[1]]
> matches1
[1]  1 14 25 29 38 55 61 88
attr(,"match.length")
[1] 3 4 3 5 4 3 3 7
attr(,"useBytes")
[1] TRUE
> # we have 8 points of at least 3 consecutive days of declining prices

>  m.length1 = attr(matches1,"match.length")
> x1 = sapply(1:length(matches1),function(ind) matches1[ind]+0:(m.length1[ind]))
> hl = function(inds) lines(time(GLD)[inds], GLD[inds], col = "red", lwd = 3)
>  plot.ts(GLD, main="Gold ETF")
>  tmp1 = sapply(x1, hl)
# that completes GLD, now onto USO

> delta = (sign(diff(USO)) == 1) + 0
> ds1 = do.call(paste0, as.list(c(delta)))
> matches1 = gregexpr("000+", ds1, perl = T)[[1]]
> matches1
[1] 39 60 68 88
attr(,"match.length")
[1] 3 5 3 4
attr(,"useBytes")
[1] TRUE
> m.length1 = attr(matches1,"match.length")
> x1 = sapply(1:length(matches1),function(ind) matches1[ind]+0:(m.length1[ind]))
>  hl = function(inds) lines(time(USO)[inds], USO[inds], col = "red", lwd = 3)
> plot.ts(USO, main = "Oil ETF")
>  tmp1 = sapply(x1, hl)



This produces the following graph.  Not much insight here, but there are exciting possibilities.