Sunday, February 17, 2019

The 3rd Edition of Mastering Machine Learning with R

Well, it is here. I have to say this is a great book. The other two editions had their charm, but this edition covers it all.

Friday, April 20, 2018

Data You May Find Interesting

Photo of the 1912 Reunion in Winchester, IN of the 19th Indiana Infantry Regiment, The Iron Brigade, 1st Division (Wadsworth), 1st Corps (Reynolds), Army of the Potomac (Meade)

If you are tired of 'mtcars' and 'iris' etc., and want something to hone your data munging skills, then have a look at this data on The Battle of Gettysburg on my github.  I have compiled the information from the book, The Gettysburg Campaign in Numbers and Losses, by J. David Petruzzi and Steven A. Stanley.  It is still a work in progress as I am adding more information to the file.  For instance, I am in the process of adding the names of the Regimental Commanders and their casualty status.

Several things to keep in mind:
1.  This file is loaded with missing data for you to manage such as artillery gun types, but also some units on the Confederate side have no known information on their casualties, and it was a couple of regiments involved in heavy fighting.  I'm interesting in seeing how people will handle those regiments!
2.  The data is as of the end of the fighting on July 3rd.  So, you will see tons of "missing" casualties, that were either dead, wounded, or captured. 
3.  Confederate sources are notorious for their inaccuracy.  The authors did an amazing piece of scholarship to close the gaps.

I think history buffs will enjoy it.  I also think if you are not even the slightest way interested in arguably the most important battle in American history, you should still give it a try, perfecting your munging and visualization skills.  I look forward to seeing what you can do with this.

Let me know if you have questions or comments.


# a little code

# loading my working copy
gettysburg_oob <- readr::read_csv("gettysburg_oob.csv")


# setting all NA to zero
gettysburg_oob[] <- 0

# this will produce the percent of infantry casualties per each of the 10 Corps
# of course I use dplyr
gettysburg_oob %>%
  filter(type == 'Infantry') %>%
  select(corps, men, total_casualties) %>%
  group_by(corps) %>%
  summarize(men = sum(men), casualties = sum(total_casualties)) %>%
  mutate(percent_casualties = casualties / men) %>%

# your turn...

Sunday, October 1, 2017

Simulation on Statistical Significance and Power

"Are the effects of A and B different? They are always different---for some decimal place." Tukey, 1991, quoted in (Cohen 1994)

"The continued very extensive use of significance tests is alarming." (Cox 1986)

"Small wonder that students have trouble [with statistical hypothesis testing]. They may be trying to think." (Deming 1975)

I put this together this morning as an aid for understanding significance and power via simulation and visualization.  This first iteration is around a two-sample t-test.

P-values CAN be useful, but use with caution.

I would appreciate any additional ideas on how to improve the code and what it is trying to convey.

The full code is on rpubs:

Sunday, July 23, 2017

Exploration of Text-Mining Packages Using HST's 'Hey Rube' Columns

Well, time for another post on text-mining.  I figured it was time to see how the quanteda and tidytext packages could supplement what I've been doing with tm and qdap. Here is the link to the Markdown document on RPubs.

Saturday, May 13, 2017

Mastering Machine Learning with R, 2nd Edition

The second edition to my book on machine learning with R is now available on

In this edition, I've added new data sets, and methods such as XgBoost, Sequential Analysis, and Multivariate Adaptive Regression Splines, which is quickly becoming my favorite technique for a number of reasons I discuss in the book.


Monday, March 13, 2017

Plotting Vietnam Airstrikes with Leaflet and R

No event in American history is more misunderstood than the Vietnam War. It was misreported then, and it is mis-remembered now. Richard M. Nixon

The war in Vietnam was not lost in the field, nor was it lost on the front pages of the New York Times or the college campuses. It was lost in Washington, D.C. H.R. McMaster


On 17 February, I attended the very moving and emotional Memorial Service at the National Infantry Museum for LTG(R) Harold G. Moore, the co-author of "We Were Soldiers Once and Young", which chronicles their experience of fighting North Vietnamese Regulars on Landing Zone (LZ) Xray.  The book has become almost required reading for any Army Officer and one cannot underestimate how important the book has become to professional development.  The movie is fine, but doesn't do the real story justice, even though Mel Gibson portrayed the intrepid Hal Moore.  An overview of the desperate struggle is available here: Several days later, I was informed of the data on airstrikes available from World War 1 through the Vietnam War on  The data is also available on  It is a treasure trove of data and was made available to support open-source analysis.

Since airpower played an integral part in saving the 1st Bn, 7th Cavalry from being overrun, I decided to explore those strikes using the Leaflet package in R. I created a subset of the data to include only those days 1/7 CAV was on the ground, 14 - 16 November, 1965.  The full code and interactive map are on RPubs:

Here is the full code as well:


df = read.csv("vietnam_nov_65.csv")


# If the same mission conducts multiple attacks on the same
# lat/long, it generates a separate observation. Therefore, I have
# chosen to dedupe on mission id number (MISSIONID)
df <- distinct(df, MISSIONID, .keep_all = T)

# US Air Force = blue
# US Marine Corps = dark red
# US Navy = light green
df$color <- ifelse(df$MILSERVICE == "USAF", "blue", ifelse(
  df$MILSERVICE == "USMC", "darkred", "lightgreen"

icons <- awesomeIcons(
  icon = "air",
  iconColor = "black",
  library = "glyphicon",
  markerColor = df$color # Branch of Service
  #text = df$MISSIONID

leaflet(df) %>%
  addProviderTiles("Esri.WorldImagery", group = "Image") %>%
  addProviderTiles("Stamen.Terrain", group = "Terrain") %>%

    baseGroups = c("Image", "Terrain"),
    overlayGroups = df$MSNDATE,
    options = layersControlOptions(collapsed = FALSE))  %>%

  addAwesomeMarkers(icon = icons,
             label = ~ df$popup,
             popup = ~ df$popup,
             group = paste(df$MSNDATE))

Wednesday, November 2, 2016

Make Text Mining Great Again

"As for Mrs. Clinton,  for all she's done for us and after all she's suffered on our behalf, she feels she's owed the presidency and who could possibly disagree?  Her life is meaningless if she doesn't get at least a shot and one can only sympathize.  Unless you think, as I do, that people should be distrusted who are running for therapeutic reasons."

"Donald Trump – a ludicrous figure, but at least he’s lived it up a bit in the real world and at least he’s worked out how to cover 90 per cent of his skull with 30 per cent of his hair. "

Christopher Hitchens

                                                                                                    David Levinson, Getty Images

I miss "Hitch".  His prescient warnings many years ago are potentially about to come home to roost, with the ultimate grifter, HRC for short, in the White House.   One can only ponder what gems he would have written during this campaign, especially given the wonderful raw material provided by Wikileaks.

However, let's not become bogged down in Clinton the sociopath or Trump the narcissist, but instead leverage their "skills" to learn and apply some R code.  The purpose of this post is two-fold: one, to use the 'rvest' package to scrape some text and second, to conduct some text analysis using 'qdap'.  If you haven't done it before, I hope this post gives you enough of the tools and enough of the inspiration to get started on your own text-mining endeavor, keeping in mind that this work merely scratches the surface.  It should supplement, certainly not replace, the excellent 'qdap' vignette available -

The raw material we will work with are the respective candidates acceptance speeches, which I have not seen by the way.  I could no more watch their speeches or the debates than I could cheer for Gopher Hockey.  By the way, I expect to see you all in Las Vegas October 27, 2018 for the North Dakota versus Minnesota U.S. Hockey Hall of Fame game. I shall attend.

Perhaps some enterprising analyst will apply the technique to the debates or maybe they already have.  The relevant code and findings are below, but are also available on RPub at this link.


"Look, this is the woman who played the race card on Barack Obama, this is the woman who, if you were for change you can believe in, whichever change it was, you were voting against.  This is the woman whose foreign policy experience consists of making a fool of herself and fabricating a story about Bosnia."

Christopher Hitchens


1. Gather the Text

The speeches are available from I used the ‘rvest’ package to identify the text and bring it into R. The only other package needed is ‘qdap’. Note that I used the Chrome Extension ‘SelectorGadget’ to scrape the relevant text.
If you run into an error loading ‘qdap’ then update your java version, making sure it matches R (x32 or x64).
donHTML <- read_html("")

hillHTML <- read_html("")
SelectorGadget facilitates selecting the right html nodes.

2. Prepare the Text

You can explore the text as you wish using html_text(). We will need to put the text into a dataframe, but there are some cleaning tasks that need to be done first.
donText <- html_text(donNode)
donText <- sub("Remarks as prepared for delivery according to a draft obtained by POLITICO Thursday afternoon.", '', donText)
donText <- sub("Story Continued Below", '', donText)
hillText <- html_text(hillNode)
hillText <- sub("Hillary Clinton's speech at the Democratic National Convention, as prepared for delivery:", '', hillText)
hillText <- sub("Story Continued Below", '', hillText)
If you end up with strange characters in your text then change the character encoding using iconv() function. The code below should do the trick.
donText <- iconv(donText, "latin1", "ASCII", "")
hillText <- iconv(hillText, "latin1", "ASCII", "")
Then this…
donText <- paste(donText, collapse = c(" ", "\n"))
hillText <- paste(hillText, collapse = c(" ", "\n"))
This is where the first ‘qdap’ function comes into play, qprep(). This function is a wrapper for a number of other replacement functions and using it will speed pre-processing, but should be used with caution if more detailed analysis is required. The functions it passes through are as follows: 1. bracketX() - apply bracket removal 2. replace_abbreviation() - changes abbreviations 3. replace_number() - numbers to words e.g. 100 becomes one hundred 4. replace_symbol() - symbols become words e.g. @ becomes ‘at’
This chunk of code does the above and also replaces contractions, removes the top 100 stopwords and strips the text of unwanted characters. Note that we will keep the period and the question marks to assist in sentence creation.
donPrep <- qprep(donText)
hillPrep <- qprep(hillText)

donPrep <- replace_contraction(donPrep)
hillPrep <- replace_contraction(hillPrep)

donRm <- rm_stopwords(donPrep, Top100Words, separate = F)
hillRm <- rm_stopwords(hillPrep, Top100Words, separate = F)

donStrip <- strip(donRm, char.keep = c("?", "."))
hillStrip <- strip(hillRm, char.keep = c("?", "."))
One of the things you can/should do is fill spaces between words, which will keep them together for the analysis such as a person’s name. The ‘keep’ list below provide an example of this and it will be used in the space_fill() function. You could include several others.
It is also now time to put both speeches into one dataframe, consisting of the text for each respective candidate.
keep <- c("United States", "Hillary Clinton", "Donald Trump", "middle class", "Supreme Court")
donFill <- data.frame(space_fill(donStrip, keep))
donFill$candidate <- "Trump"
colnames(donFill)[1] <- "text"
hillFill <- data.frame(space_fill(hillStrip, keep))
hillFill$candidate <- "Clinton"
colnames(hillFill)[1] <- "text"
df1 <- rbind(donFill, hillFill)
Critical to any analysis with the ‘qdap’ package is to put the text into sentences with the sentSplit() function. It also creates the ‘tot’ variable or ‘turn of talk’ index, which is something that would be important for analyzing the debates.
df2 <- sentSplit(df1, "text")
## Warning in sentSplit(df1, "text"): The following problems were detected:
## non character, missing ending punctuation, indicating incomplete
## *Consider running `check_text`
## Classes 'sent_split', 'qdap_df', 'sent_split_text_var:text' and 'data.frame':    660 obs. of  3 variables:
##  $ candidate: chr  "Trump" "Trump" "Trump" "Trump" ...
##  $ tot      : chr  "1.1" "1.2" "1.3" "1.4" ...
##  $ text     : chr  "friends delegates fellow americans humbly gratefully accept nomination presidency United~~States." "together lead our party back white house lead our country back safety prosperity peace." "country generosity warmth." "also country law order." ...
##  - attr(*, "text.var")= chr "text"
##  - attr(*, "qdap_df_text.var")= chr "text"
We’ve come to the point I think where stemming would be implemented. That is, to reduce a word to its root e.g. stems, stemming, stemmer all become stem. However, I’m not necessarily a big fan of it anymore and believe it should be applied judiciously. A number of highly experienced text miners have helped me correct the error of my former auto-stemming ways. Also, ‘qdap’ has some flexibility in comparing stemmed text versus non-stemmed text as we shall soon see.

3. Preliminary Analysis

I’ll start out with the standard word frequency analysis. As is usually the case with ‘qdap’, there are a number of options to accomplish a task. On your own have a look at the bag o words() and word_count() functions. Here I create a df of the 25 most frequent terms by candidate and compare that data in a plot.
freq <- freq_terms(df2$text)

donFreq <- df2[df2$candidate == "Trump", ]
donFreq <- freq_terms(donFreq$text)
hillFreq <- df2[df2$candidate == "Clinton", ]
hillFreq <- freq_terms(hillFreq$text)
# par(mfrow=c(1,2))


No surprise that Trump hits “trade”, “violence”, “immigration” and “law”. Hillary likes to talk about “us” and “me” (real shock there). Nothing about children or families?
You can create a word frequency matrix, which provides the counts for each word by speaker
wordMat <- wfm(df2$text, df2$candidate)
wordMat[c(1:5, 350:354), ]
##           Clinton Trump
## abandon         0     1
## abandoned       1     1
## able            2     2
## abolish         0     1
## abroad          1     2
## crosser         0     1
## crossings       0     1
## crucial         1     0
## crushed         0     1
## crying          0     1
Of course we need to include the obligatory word cloud. In this case, I will use stemmed words
trans_cloud(df2$text, df2$candidate, stem = T, min.freq = 10)

There you have it, children and families now appear. Quite a heavy burden being engaged in what former Assistant Director of the FBI, James Kallstrom, characterized as a criminal foundation AND caring for families and children. Now that is leadership!
But I digress. A great function is ‘word_associate()’ and building word clouds based on that association. Let’s give “terror” a try.
word_associate(df2$text, df2$candidate, match.string = "terror", wordcloud = T)

##    row   group unit text                                                                                                                        
## 1    6   Trump    6 attacks our police terrorism our cities threaten our very life.                                                             
## 2   82   Trump   82 plan begin safety home means safe neighborhoods secure borders protection terrorism.                                        
## 3  114   Trump  114 task our new administration liberate our citizens crime terrorism lawlessness threatens communities.                        
## 4  133   Trump  133 once again france victim brutal islamic terrorism.                                                                          
## 5  139   Trump  139 only weeks ago orlando florida forty nine wonderful americans savagely murdered islamic terrorist.                          
## 6  140   Trump  140 terrorist targeted our lgbt community.                                                                                      
## 7  142   Trump  142 protect us terrorism need focus three things.                                                                               
## 8  145   Trump  145 instead must work our allies share our goal destroying isis stamping islamic terror.                                        
## 9  147   Trump  147 lastly must immediately suspend immigration any nation compromised terrorism until such proven vetting mechanisms put place.
## 10 328 Clinton  328 work americans our allies fight terrorism.                                                                                  
## 11 596 Clinton  596 should working responsible gun owners pass common sense reforms keep guns hands criminals terrorists others us harm.
## Match Terms
## ===========
## List 1:
## terrorism, terrorist, terror, terrorists
No commentary needed as “res ipsa loquitur”.
Comprehensive word statistics are available. Here is a plot of the stats available in the package. The plot loses some of its visual appeal with just two speakers, but it should stimulate your interest nontheless. A complete explanation of the stats is available under ?word_stats
ws <- word_stats(df2$text, df2$candidate, rm.incomplete = T)
## Warning in end_inc(dataframe = DF, text.var = text.var, ...): 17 incomplete sentence items removed
plot(ws, label = T, lab.digits = 2)
## Warning: attributes are not identical across measure variables; they will
## be dropped

Interesting the breakdown in the count of sentences and words. Hillary used a hundred more sentences, but only two hundred more words. I’m curious as to what questions they asked and how they incorporated them.
x1 <- question_type(df2$text, grouping.var = df2$candidate)
##   candidate     where      does       huh    unknown
## 1   Clinton        18         0  1(5.56%) 2(11.11%) 15(83.33%)
## 2     Trump         7 3(42.86%) 1(14.29%)         0  3(42.86%)
##    candidate   raw.text n.row endmark strip.text  q.type
## 1      Trump Our econom    38       ?  our econo unknown
## 2      Trump  Yet show?    46       ?  yet show  unknown
## 3      Trump After four    64       ?  after fou unknown
## 4      Trump Every acti   131       ?  every act    does
## 5      Trump Where sanc   161       ?  where san   where
## 6      Trump Where sanc   162       ?  where san   where
## 7      Trump Where sanc   163       ?  where san   where
## 8    Clinton Stay true    313       ?  stay true unknown
## 9    Clinton    Really?   349       ?    really  unknown
## 10   Clinton Alone fix?   350       ?  alone fix unknown
## 11   Clinton Forgetting   351       ?  forgettin unknown
## 12   Clinton Know commu   365       ?  know comm unknown
## 13   Clinton Lot looked   369       ?  lot looke unknown
## 14   Clinton  Big idea?   425       ?  big idea  unknown
## 15   Clinton Idea real?   427       ?  idea real unknown
## 16   Clinton      Know?   472       ?      know  unknown
## 17   Clinton          ?   473       ?                huh
## 18   Clinton          ?   474       ?                huh
## 19   Clinton Going done   534       ?  going don unknown
## 20   Clinton Going brea   535       ?  going bre unknown
## 21   Clinton Sales pitc   544       ?  sales pit unknown
## 22   Clinton Put faith    545       ?  put faith unknown
## 23   Clinton Ask yourse   579       ?  ask yours    does
## 24   Clinton Ask just s   598       ?  ask just  unknown
## 25   Clinton  Offering?   625       ?  offering  unknown
OK, we’ve learned that rows 473 and 474 should be thrown out. Also looks like we have the classic use of an anaphora by Trump, which is the technique of repeating the first word or words of several consecutive sentences. I think Churchill used it quite a bit e.g. “We shall not flag or fail. We shall go on to the end. We shall fight in France, we shall…”"
df2[c(161:163), 3] 
## [1] "where sanctuary kate steinle?"                                 
## [2] "where sanctuary children mary ann sabine jamiel?"              
## [3] "where sanctuary americans brutally murdered suffered horribly?"
df2[c(473:474), 3]
## [1] "?" "?"
df2 <- df2[c(-473,-474), ]

4. Advanced Analysis

This is where it gets fun with ‘qdap’. You can tag the text by parts of speech. Check out ?pos and have a look at the vignette for further explanation
Be advised that this takes some time, which you can track with a progress bar. Notice Clinton’s use and Trump’s lack of use of interjections.
posbydf <- pos_by(df2$text, grouping.var = df2$candidate)
##  [1] "text"         "POStagged"    "POSprop"      "POSfreq"     
##  [5] "POSrnp"       "percent"      "zero.replace" "" 
##  [9] ""  ""
plot(posbydf, values = T, digits = 2)
Readability scores (measures of speech complexity) are available. I won’t go into the details as I discuss this in my book and detailed information is in the ‘qdap’ vignette.
automated_readability_index(df2$text, df2$candidate)
##   candidate word.count sentence.count character.count Automated_Readability_Index
## 1   Clinton       2636            391           15155                       9.020
## 2     Trump       2349            267           14616                      12.276
Diversity stats are a measure of language “richness” or rather, how expansive is a speakers vocabulary. The results indicate similar use of vocabulary, certainly not unusual given the assistance of professional speech writers.
diversity(df2$text, df2$candidate)
##   candidate   wc simpson shannon collision berger_parker brillouin
## 1   Clinton 2636   0.997   6.609     5.842         0.028     6.060
## 2     Trump 2349   0.997   6.613     5.708         0.040     6.032
Formality contextualizes the text by comparing formal parts of speech (noun, adjective, preposition and article) versus contextual parts of speech (pronoun, verb, adverb, interjection). A plot for analysis is available. Scores closer to 100 are more formal and those closer to 1 are more contextual.
form <- formality(df2$text, df2$candidate)
##   candidate word.count formality
## 1     Trump       2363     66.55
## 2   Clinton       2651     60.68

Polarity measures sentence sentiment. A plot is available. What we see is that, on average, Trump was slightly more negative.
pol <- polarity(df2$text, df2$candidate)

The lexical dispersion plot allows one to see how a word occurs throughout the text. It is interesting to view to see how topics change over time. Note that you can also include freq_terms should you so choose.
dispersion_plot(df2$text, c("immigration", "jobs", "trade", "children"), df2$candidate)

Finally, an example of a gradient wordcloud, which produces one wordcloud colored by a binary grouping variable. Let’s do one with words not stemmed and one with stemming included.
gradient_cloud(df2$text, df2$candidate, min.freq = 12, stem = F)

gradient_cloud(df2$text, df2$candidate, min.freq = 15, stem = T)
There you have it. Now go find text data, manipulate text data, analyze text data and make text-mining great again.