Translate

Wednesday, November 2, 2016

Make Text Mining Great Again

"As for Mrs. Clinton,  for all she's done for us and after all she's suffered on our behalf, she feels she's owed the presidency and who could possibly disagree?  Her life is meaningless if she doesn't get at least a shot and one can only sympathize.  Unless you think, as I do, that people should be distrusted who are running for therapeutic reasons."

"Donald Trump – a ludicrous figure, but at least he’s lived it up a bit in the real world and at least he’s worked out how to cover 90 per cent of his skull with 30 per cent of his hair. "

Christopher Hitchens


                                                                                                    David Levinson, Getty Images

I miss "Hitch".  His prescient warnings many years ago are potentially about to come home to roost, with the ultimate grifter, HRC for short, in the White House.   One can only ponder what gems he would have written during this campaign, especially given the wonderful raw material provided by Wikileaks.

However, let's not become bogged down in Clinton the sociopath or Trump the narcissist, but instead leverage their "skills" to learn and apply some R code.  The purpose of this post is two-fold: one, to use the 'rvest' package to scrape some text and second, to conduct some text analysis using 'qdap'.  If you haven't done it before, I hope this post gives you enough of the tools and enough of the inspiration to get started on your own text-mining endeavor, keeping in mind that this work merely scratches the surface.  It should supplement, certainly not replace, the excellent 'qdap' vignette available -  https://trinker.github.io/qdap/vignettes/qdap_vignette.html

The raw material we will work with are the respective candidates acceptance speeches, which I have not seen by the way.  I could no more watch their speeches or the debates than I could cheer for Gopher Hockey.  By the way, I expect to see you all in Las Vegas October 27, 2018 for the North Dakota versus Minnesota U.S. Hockey Hall of Fame game. I shall attend.


Perhaps some enterprising analyst will apply the technique to the debates or maybe they already have.  The relevant code and findings are below, but are also available on RPub at this link.

http://rpubs.com/Custer/text

Cheers.

"Look, this is the woman who played the race card on Barack Obama, this is the woman who, if you were for change you can believe in, whichever change it was, you were voting against.  This is the woman whose foreign policy experience consists of making a fool of herself and fabricating a story about Bosnia."

Christopher Hitchens

------------------------------------------------------------------------------------------------------

1. Gather the Text

The speeches are available from http://politico.com. I used the ‘rvest’ package to identify the text and bring it into R. The only other package needed is ‘qdap’. Note that I used the Chrome Extension ‘SelectorGadget’ to scrape the relevant text.
library(rvest)
library(qdap)
library(SnowballC)
If you run into an error loading ‘qdap’ then update your java version, making sure it matches R (x32 or x64).
donHTML <- read_html("http://www.politico.com/story/2016/07/full-transcript-donald-trump-nomination-acceptance-speech-at-rnc-225974")

hillHTML <- read_html("http://www.politico.com/story/2016/07/full-text-hillary-clintons-dnc-speech-226410")
SelectorGadget facilitates selecting the right html nodes.

2. Prepare the Text

You can explore the text as you wish using html_text(). We will need to put the text into a dataframe, but there are some cleaning tasks that need to be done first.
donText <- html_text(donNode)
donText <- sub("Remarks as prepared for delivery according to a draft obtained by POLITICO Thursday afternoon.", '', donText)
donText <- sub("Story Continued Below", '', donText)
hillText <- html_text(hillNode)
hillText <- sub("Hillary Clinton's speech at the Democratic National Convention, as prepared for delivery:", '', hillText)
hillText <- sub("Story Continued Below", '', hillText)
If you end up with strange characters in your text then change the character encoding using iconv() function. The code below should do the trick.
donText <- iconv(donText, "latin1", "ASCII", "")
hillText <- iconv(hillText, "latin1", "ASCII", "")
Then this…
donText <- paste(donText, collapse = c(" ", "\n"))
hillText <- paste(hillText, collapse = c(" ", "\n"))
This is where the first ‘qdap’ function comes into play, qprep(). This function is a wrapper for a number of other replacement functions and using it will speed pre-processing, but should be used with caution if more detailed analysis is required. The functions it passes through are as follows: 1. bracketX() - apply bracket removal 2. replace_abbreviation() - changes abbreviations 3. replace_number() - numbers to words e.g. 100 becomes one hundred 4. replace_symbol() - symbols become words e.g. @ becomes ‘at’
This chunk of code does the above and also replaces contractions, removes the top 100 stopwords and strips the text of unwanted characters. Note that we will keep the period and the question marks to assist in sentence creation.
donPrep <- qprep(donText)
hillPrep <- qprep(hillText)

donPrep <- replace_contraction(donPrep)
hillPrep <- replace_contraction(hillPrep)

donRm <- rm_stopwords(donPrep, Top100Words, separate = F)
hillRm <- rm_stopwords(hillPrep, Top100Words, separate = F)

donStrip <- strip(donRm, char.keep = c("?", "."))
hillStrip <- strip(hillRm, char.keep = c("?", "."))
One of the things you can/should do is fill spaces between words, which will keep them together for the analysis such as a person’s name. The ‘keep’ list below provide an example of this and it will be used in the space_fill() function. You could include several others.
It is also now time to put both speeches into one dataframe, consisting of the text for each respective candidate.
keep <- c("United States", "Hillary Clinton", "Donald Trump", "middle class", "Supreme Court")
donFill <- data.frame(space_fill(donStrip, keep))
donFill$candidate <- "Trump"
colnames(donFill)[1] <- "text"
hillFill <- data.frame(space_fill(hillStrip, keep))
hillFill$candidate <- "Clinton"
colnames(hillFill)[1] <- "text"
df1 <- rbind(donFill, hillFill)
Critical to any analysis with the ‘qdap’ package is to put the text into sentences with the sentSplit() function. It also creates the ‘tot’ variable or ‘turn of talk’ index, which is something that would be important for analyzing the debates.
df2 <- sentSplit(df1, "text")
## Warning in sentSplit(df1, "text"): The following problems were detected:
## non character, missing ending punctuation, indicating incomplete
## 
## *Consider running `check_text`
str(df2)
## Classes 'sent_split', 'qdap_df', 'sent_split_text_var:text' and 'data.frame':    660 obs. of  3 variables:
##  $ candidate: chr  "Trump" "Trump" "Trump" "Trump" ...
##  $ tot      : chr  "1.1" "1.2" "1.3" "1.4" ...
##  $ text     : chr  "friends delegates fellow americans humbly gratefully accept nomination presidency United~~States." "together lead our party back white house lead our country back safety prosperity peace." "country generosity warmth." "also country law order." ...
##  - attr(*, "text.var")= chr "text"
##  - attr(*, "qdap_df_text.var")= chr "text"
We’ve come to the point I think where stemming would be implemented. That is, to reduce a word to its root e.g. stems, stemming, stemmer all become stem. However, I’m not necessarily a big fan of it anymore and believe it should be applied judiciously. A number of highly experienced text miners have helped me correct the error of my former auto-stemming ways. Also, ‘qdap’ has some flexibility in comparing stemmed text versus non-stemmed text as we shall soon see.

3. Preliminary Analysis

I’ll start out with the standard word frequency analysis. As is usually the case with ‘qdap’, there are a number of options to accomplish a task. On your own have a look at the bag o words() and word_count() functions. Here I create a df of the 25 most frequent terms by candidate and compare that data in a plot.
freq <- freq_terms(df2$text)
plot(freq)

donFreq <- df2[df2$candidate == "Trump", ]
donFreq <- freq_terms(donFreq$text)
hillFreq <- df2[df2$candidate == "Clinton", ]
hillFreq <- freq_terms(hillFreq$text)
# par(mfrow=c(1,2))
plot(donFreq)

plot(hillFreq)

No surprise that Trump hits “trade”, “violence”, “immigration” and “law”. Hillary likes to talk about “us” and “me” (real shock there). Nothing about children or families?
You can create a word frequency matrix, which provides the counts for each word by speaker
wordMat <- wfm(df2$text, df2$candidate)
wordMat[c(1:5, 350:354), ]
##           Clinton Trump
## abandon         0     1
## abandoned       1     1
## able            2     2
## abolish         0     1
## abroad          1     2
## crosser         0     1
## crossings       0     1
## crucial         1     0
## crushed         0     1
## crying          0     1
Of course we need to include the obligatory word cloud. In this case, I will use stemmed words
trans_cloud(df2$text, df2$candidate, stem = T, min.freq = 10)



There you have it, children and families now appear. Quite a heavy burden being engaged in what former Assistant Director of the FBI, James Kallstrom, characterized as a criminal foundation AND caring for families and children. Now that is leadership!
But I digress. A great function is ‘word_associate()’ and building word clouds based on that association. Let’s give “terror” a try.
word_associate(df2$text, df2$candidate, match.string = "terror", wordcloud = T)



##    row   group unit text                                                                                                                        
## 1    6   Trump    6 attacks our police terrorism our cities threaten our very life.                                                             
## 2   82   Trump   82 plan begin safety home means safe neighborhoods secure borders protection terrorism.                                        
## 3  114   Trump  114 task our new administration liberate our citizens crime terrorism lawlessness threatens communities.                        
## 4  133   Trump  133 once again france victim brutal islamic terrorism.                                                                          
## 5  139   Trump  139 only weeks ago orlando florida forty nine wonderful americans savagely murdered islamic terrorist.                          
## 6  140   Trump  140 terrorist targeted our lgbt community.                                                                                      
## 7  142   Trump  142 protect us terrorism need focus three things.                                                                               
## 8  145   Trump  145 instead must work our allies share our goal destroying isis stamping islamic terror.                                        
## 9  147   Trump  147 lastly must immediately suspend immigration any nation compromised terrorism until such proven vetting mechanisms put place.
## 10 328 Clinton  328 work americans our allies fight terrorism.                                                                                  
## 11 596 Clinton  596 should working responsible gun owners pass common sense reforms keep guns hands criminals terrorists others us harm.
## 
## Match Terms
## ===========
## 
## List 1:
## terrorism, terrorist, terror, terrorists
## 
No commentary needed as “res ipsa loquitur”.
Comprehensive word statistics are available. Here is a plot of the stats available in the package. The plot loses some of its visual appeal with just two speakers, but it should stimulate your interest nontheless. A complete explanation of the stats is available under ?word_stats
ws <- word_stats(df2$text, df2$candidate, rm.incomplete = T)
## Warning in end_inc(dataframe = DF, text.var = text.var, ...): 17 incomplete sentence items removed
plot(ws, label = T, lab.digits = 2)
## Warning: attributes are not identical across measure variables; they will
## be dropped


Interesting the breakdown in the count of sentences and words. Hillary used a hundred more sentences, but only two hundred more words. I’m curious as to what questions they asked and how they incorporated them.
x1 <- question_type(df2$text, grouping.var = df2$candidate)
x1
##   candidate tot.quest     where      does       huh    unknown
## 1   Clinton        18         0  1(5.56%) 2(11.11%) 15(83.33%)
## 2     Trump         7 3(42.86%) 1(14.29%)         0  3(42.86%)
truncdf(x1$raw)
##    candidate   raw.text n.row endmark strip.text  q.type
## 1      Trump Our econom    38       ?  our econo unknown
## 2      Trump  Yet show?    46       ?  yet show  unknown
## 3      Trump After four    64       ?  after fou unknown
## 4      Trump Every acti   131       ?  every act    does
## 5      Trump Where sanc   161       ?  where san   where
## 6      Trump Where sanc   162       ?  where san   where
## 7      Trump Where sanc   163       ?  where san   where
## 8    Clinton Stay true    313       ?  stay true unknown
## 9    Clinton    Really?   349       ?    really  unknown
## 10   Clinton Alone fix?   350       ?  alone fix unknown
## 11   Clinton Forgetting   351       ?  forgettin unknown
## 12   Clinton Know commu   365       ?  know comm unknown
## 13   Clinton Lot looked   369       ?  lot looke unknown
## 14   Clinton  Big idea?   425       ?  big idea  unknown
## 15   Clinton Idea real?   427       ?  idea real unknown
## 16   Clinton      Know?   472       ?      know  unknown
## 17   Clinton          ?   473       ?                huh
## 18   Clinton          ?   474       ?                huh
## 19   Clinton Going done   534       ?  going don unknown
## 20   Clinton Going brea   535       ?  going bre unknown
## 21   Clinton Sales pitc   544       ?  sales pit unknown
## 22   Clinton Put faith    545       ?  put faith unknown
## 23   Clinton Ask yourse   579       ?  ask yours    does
## 24   Clinton Ask just s   598       ?  ask just  unknown
## 25   Clinton  Offering?   625       ?  offering  unknown
OK, we’ve learned that rows 473 and 474 should be thrown out. Also looks like we have the classic use of an anaphora by Trump, which is the technique of repeating the first word or words of several consecutive sentences. I think Churchill used it quite a bit e.g. “We shall not flag or fail. We shall go on to the end. We shall fight in France, we shall…”"
df2[c(161:163), 3] 
## [1] "where sanctuary kate steinle?"                                 
## [2] "where sanctuary children mary ann sabine jamiel?"              
## [3] "where sanctuary americans brutally murdered suffered horribly?"
df2[c(473:474), 3]
## [1] "?" "?"
df2 <- df2[c(-473,-474), ]

4. Advanced Analysis

This is where it gets fun with ‘qdap’. You can tag the text by parts of speech. Check out ?pos and have a look at the vignette for further explanation https://trinker.github.io/qdap/vignettes/qdap_vignette.html
Be advised that this takes some time, which you can track with a progress bar. Notice Clinton’s use and Trump’s lack of use of interjections.
posbydf <- pos_by(df2$text, grouping.var = df2$candidate)
names(posbydf)
##  [1] "text"         "POStagged"    "POSprop"      "POSfreq"     
##  [5] "POSrnp"       "percent"      "zero.replace" "pos.by.freq" 
##  [9] "pos.by.prop"  "pos.by.rnp"
plot(posbydf, values = T, digits = 2)
Readability scores (measures of speech complexity) are available. I won’t go into the details as I discuss this in my book and detailed information is in the ‘qdap’ vignette.
automated_readability_index(df2$text, df2$candidate)
##   candidate word.count sentence.count character.count Automated_Readability_Index
## 1   Clinton       2636            391           15155                       9.020
## 2     Trump       2349            267           14616                      12.276
Diversity stats are a measure of language “richness” or rather, how expansive is a speakers vocabulary. The results indicate similar use of vocabulary, certainly not unusual given the assistance of professional speech writers.
diversity(df2$text, df2$candidate)
##   candidate   wc simpson shannon collision berger_parker brillouin
## 1   Clinton 2636   0.997   6.609     5.842         0.028     6.060
## 2     Trump 2349   0.997   6.613     5.708         0.040     6.032
Formality contextualizes the text by comparing formal parts of speech (noun, adjective, preposition and article) versus contextual parts of speech (pronoun, verb, adverb, interjection). A plot for analysis is available. Scores closer to 100 are more formal and those closer to 1 are more contextual.
form <- formality(df2$text, df2$candidate)
form
##   candidate word.count formality
## 1     Trump       2363     66.55
## 2   Clinton       2651     60.68
plot(form)

Polarity measures sentence sentiment. A plot is available. What we see is that, on average, Trump was slightly more negative.
pol <- polarity(df2$text, df2$candidate)
plot(pol)

The lexical dispersion plot allows one to see how a word occurs throughout the text. It is interesting to view to see how topics change over time. Note that you can also include freq_terms should you so choose.
dispersion_plot(df2$text, c("immigration", "jobs", "trade", "children"), df2$candidate)

Finally, an example of a gradient wordcloud, which produces one wordcloud colored by a binary grouping variable. Let’s do one with words not stemmed and one with stemming included.
gradient_cloud(df2$text, df2$candidate, min.freq = 12, stem = F)

gradient_cloud(df2$text, df2$candidate, min.freq = 15, stem = T)
There you have it. Now go find text data, manipulate text data, analyze text data and make text-mining great again.

Tuesday, June 14, 2016

Iraq-Wikileaks Analysis with R

In a place of extreme violence and devoid of order, the practical subsumes the principle. I drifted down the path of bribery and corruption endemic to the streets of Baghdad”.

Jason Whiteley, Father of Money: Buying Peace in Baghdad




As I mentioned in a previous post, I wanted to explore the Wikileaks data of the US Military's reported Significant Activities (SIGACTS).  It will be a subset of the famous Wikileaks classified US military documents.   Private Bradley Manning provided this material to Wikileaks.  He is now behind bars, receiving a 35-year sentence in 2013.  The subset of these documents I will use is available on The Guardian’s datablog website at this link:


The Guardian created this subset by selecting only those SIGACT reports that were associated with deaths of personnel and also that they felt did not compromise confidential sources.  It is stored in a Google Fusion Table.

The code provided merely scratches the surface of analysis that one can do with the data set of roughly 52,000 SIGACTs.  What I show is how to pull the data into R, conduct some basic data wrangling, create a subset, perform a cluster analysis and finally, build maps.  In creating the maps, I show how to create a static map with ggplot package as well as an interactive map with the leaflet package.

The subset of the data will focus on 2009 and the area assigned to Multi-National Division Baghdad since I spent 10 months of that year there and roughly 99% of the time in that Division’s Area of Responsibility.

The analysis with code and commentary is on RPubs.com at the following link: