Load libraries

library(tm) # Framework for text mining.
library(SnowballC) # Provides wordStem() for stemming.
library(qdap) # Quantitative discourse analysis of transcripts.
library(qdapDictionaries)
library(dplyr) # Data preparation and pipes %>%.
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.
library(scales) # Include commas in nu
library(Rgraphviz)
library(wordcloud)
library(rCharts)
library(dplyr)
library(stringr)

Handling a corpus with tm

A corpus is a collection of written texts, usually with a common theme or topic. To demonstrate text mining using the tm package, I will use a collection of short essays I wrote in my senior year at Gonzaga. These essays assess Amazon’s performance through the lens of Strategic Management. As compared to using a random collection of texts, I chose to use my own writing because text mining also has implications for learning about one’s writing style. We will see this in play when we identify frequent items and associations later in the tutorial.

Let’s start with a few functions which help provide a glimpse of the type of data that can be handled by tm

#List data sources supported by tm
getSources()
## [1] "DataframeSource" "DirSource"       "URISource"       "VectorSource"   
## [5] "XMLSource"       "ZipSource"

In my case, I will be using text files local to my computer and will therefore use the direct source function.

#List document "readers" for text analysis
getReaders()
##  [1] "readDOC"                 "readPDF"                
##  [3] "readPlain"               "readRCV1"               
##  [5] "readRCV1asPlain"         "readReut21578XML"       
##  [7] "readReut21578XMLasPlain" "readTabular"            
##  [9] "readTagged"              "readXML"

Create directory and load corpus

I have created a folder for my text files at “./corpus/txt”. The name of this directory is stored in the cname variable as a character string. The file.path() function is handy for creating directory paths, as seen below.

#create directory name
cname <- file.path(".", "corpus", "txt")
cname
## [1] "./corpus/txt"

We can now access the corpus at this directory and view the number, as well as the names, of the text files.

#View the number of text files in the corpus
length(dir(cname))
## [1] 10
#View the names of the text files in the corpus
dir(cname)
##  [1] "Amazon Analysis– Module 3.txt" "Amazon Paper– Module 2.txt"   
##  [3] "Amazon- Module 4_Final.txt"    "Amazon-Module 5.txt"          
##  [5] "Amazon– Module 12.txt"         "Amazon– Module 9.txt"         
##  [7] "Amazon–Module 10.txt"          "Amazon–Module 6.txt"          
##  [9] "Amazon–Module 7.txt"           "Amazon–Module 8.txt"

Now we’re ready to load the corpus from the directory using DirSource(). The source object is passed to the Corpus() function, which loads the text files. We save the corpus in memory as a variable called docs.

docs <- Corpus(DirSource(cname))

#Summarize corpus
summary(docs)
##                               Length Class             Mode
## Amazon Analysis– Module 3.txt 2      PlainTextDocument list
## Amazon Paper– Module 2.txt    2      PlainTextDocument list
## Amazon- Module 4_Final.txt    2      PlainTextDocument list
## Amazon-Module 5.txt           2      PlainTextDocument list
## Amazon– Module 12.txt         2      PlainTextDocument list
## Amazon– Module 9.txt          2      PlainTextDocument list
## Amazon–Module 10.txt          2      PlainTextDocument list
## Amazon–Module 6.txt           2      PlainTextDocument list
## Amazon–Module 7.txt           2      PlainTextDocument list
## Amazon–Module 8.txt           2      PlainTextDocument list

Handling other data sources

If your data is in a form other than common text files, tm offers handling for popular document types such as PDF and Word. Notice the reader for each file type is specified using the readerControl parameter.

PDF Documents

# docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))

Word Documents

# docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC))

Exploring the corpus further

More information about a particular text file within the corpus can be viewed using the inspect function. I had problems with displaying the text file’s contents with inspect(), so I alternatively used writeLines() to display the content.

inspect(docs[4])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 6669
writeLines(as.character(docs[[4]]))
## Granger Moch
## February 22nd, 2015
## Strategic Management
## Amazon- Module 5
## 
## 1.) Amazon<d5>s products and services are wide-ranging and easily surpass any competing e-retailer.  Amazon Prime, one of the more profitable services offered by the company, now goes far beyond the benefits of two-day shipping. The membership offers instant streaming of movies and TV shows, free e-books, ad-free music, and more. Also another Internet service, Amazon Web Services offers cloud computing, storage and content delivery, mobile services, and more. As previously mentioned, the company is also penetrating the fresh mood market with Amazon Fresh. Clearly, Amazon has well differentiated its services. On the product front, the company is equally diversified.  Certainly, there is the popular Kindle along with other electronics, but Amazon also offers toys, beauty products, textbooks and many other products. Amazon is somewhat disadvantaged with certain product categories they sell, namely heavier or more fragile items. Items such as heavy hardware, or televisions are expensive and precarious to ship; generally, customers would rather have pick up these items at a retailer themselves. The basis of Amazon<d5>s highly differentiated product selection is seen in their mission to deliver a wide range of products at the lowest possible cost.
## 
## 2.) What<d5>s interesting about the way Amazon segments is that their basis is not on demographic, age, or consumer-type. Instead, the company segments based on purchasing habits. Amazon<d5>s aim is to create a <d2>target of one<d3><d0>meaning a micro-segment containing a single individual. Companies such as Pandora and Netflix are also well known for this approach, which tailors the user<d5>s experience to their own preference. These companies use two crafty ways to accomplish this micro-segmentation; the first is individual leverage. As a user returns to Amazon more frequently, the company <d2>learns<d3> more about the customer and their purchasing habits. From these habits, Amazon is able to form patterns and trends, allowing them to in turn target the customer even more directly. The other method is group leverage, which consists of using similarities between customers<d5> purchasing habits so as to recommend and cross-sell other products with high success rates. The more consumers that use Amazon, the more accurate Amazon<d5>s recommendations become. Micro segmenting for Amazon is dynamic, self-perpetuating and incurs low costs.
## 
## 3.) Amazon<d5>s main distinctive competency is the buying experience they are able to offer the consumer, but as of late Amazon is also becoming an increasingly powerful content provider. Both quality and customer responsiveness are ultimately the driving force behind these competencies.  The micro segmenting done by the company for both its physical and digital products, allows Amazon to consistently deliver a quality experience that is also tailored to the consumer<d5>s interests. Also in the same vein of customer responsiveness is Amazon<d5>s logistical prowess.  Amazon is able to quickly and dependably deliver a product to a consumer through their warehouse and order handling efficiency. Certainly, this building block shares some overlap with efficiency, but their efficiency is ultimately catered towards providing the customer<d5>s experience and responding to their needs.
## 
## 4.) The generic business model pursued by Amazon is that of cost leadership<d0> which is concerned with  organizing all resources towards providing products and services at the lowest cost possible. By having lower costs associated with making, selling, and moving products, Amazon has enabled itself to sell their products at a lower price. A cost leader such as Amazon is capable of withstanding price wars with competitors due to their lower cost structure.  The company<d5>s approach to business is perhaps best captured by Amazon CEO, Jeff Bezos himself: <d2>There are two kinds of companies: Those that work to try to charge more and those that work to charge less. We will be the second.<d3> What<d5>s important to note about a company consistently offering the lowest prices to their customers, is that a low-cost structure is central to surviving the waves of demand.
##    Another generic strategy
## 
## Amazon<d5>s motto for their approach to business has evolved with time and in response to changes in the environment. In the late 90<d5>s, only a few years after the companies inception in 1995, Amazon<d5>s model was rooted in a <d2>sell all, carry few<d3> ideology. At the time, this was a relatively easy and profitable model. Although they sold more than a million books at this time, <d2>it actually only stocked about 2,000<d3> at a time. All other book orders were simply passed on to book wholesalers or publishers for the delivery of the product. Soon however, Amazon would realize that the wholesalers they were in network with weren<d5>t all that great at delivering the books themselves. As other retailers adopted this <d2>drop-shipping<d3> approach to compete with Amazon, the company realized their model was too easily imitated and no longer ideal. Hence, what was <d2>sell all, carry few<d3> turned into <d2>sell all, carry more.<d3> Amazon<d5>s warehouses witnessed the change in perspective, as the company pursued functional strategies to supplement their recent low-cost approach. Amazon expanded its product storage capabilities and began taking on more products.  This new motto <d2>sell all, carry more<d3> was heavily dependent upon the reliable delivery and logistics that Amazon is recognized for today. Recent implementation of robots in shipping centers is further indicative of Amazon<d5>s efforts. 
## 
## Amazon took this approach one step further in 2006, when they introduced <d2>Fullfillment by Amazon<d3>.  This tactic allowed the company to effectively become a wholesaler via smaller storefronts. Sellers independent from Amazon could now make use of the company<d5>s superior logistics to deliver products and in turn, they can focus more on selling their products. Today, Amazon has expanded to more than 50 warehouses and has spent more than $15 billion in doing so. Obviously, Amazon<d5>s business approach has come full circle. It is important to note that the main figure behind Amazon<d5>s success, Jeff Bezos, fully knew that a business model could quickly become outdated. The company has reached the status it has achieved today through its adaptation and constant appraisal of the presiding model.  
## 
## 
## 5.) d
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 6.) d
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 7.) d
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## http://www.theinnovativemanager.com/the-two-business-strategies-cost-leadership-and-benefit-leadership-and-where-michael-porter-missed-the-mark/

Corpus transformations

Symbols and other characters of the like, at least for now, are not all that relevant to our analysis of the corpus. Furthermore, symbols such as a dash or a colon are often used to separate words. By removing these characters, we ensure that the two words aren’t accidentally merged into a single string during corpus transformations. The function toSpace below uses content.transformer() to create the transformation, and the transformation is applied to the corpus using the tm_map() function.

Let’s go ahead and replace a few common symbols with blank space to ease our analysis.

Removing specific characters and patterns

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace,"\\|")
docs <- tm_map(docs, toSpace, "<")
docs <- tm_map(docs, toSpace, ">")

Convert to lower case

docs <- tm_map(docs, content_transformer(tolower))

This provides a more uniform appearance for word clouds as we will see shortly.

Remove numbers

#Remove Numbers
docs <- tm_map(docs, removeNumbers)

Strip white space

docs <- tm_map(docs, stripWhitespace)

Remove punctuation

docs <- tm_map(docs, removePunctuation)

Remove “stop words”

Stop words are words that are frequently used in a language. We will remove the 174 English stop words from our corpus below.

#List the number of stop words in the english language
length(stopwords("english"))
## [1] 174
#view all stop words for reference
stopwords("english")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
#Remove Stop Words
docs <- tm_map(docs, removeWords, stopwords("english"))

Remove specified words

After initially mining this corpus, I came across several words such as “also”, and “company” that naturally occured more frequently in my essays. Furthermore, several singular words shared similar usage with their plural counterparts (company, companies), and hence there was some redundancy. I will manually remove these words from the corpus below.

docs <- tm_map(docs, removeWords, c("granger", "moch", "module","http","allow","can", "also", "will","however","companies","company", "consumers","customers","even", "although", "products", "many", "well", "may","one","much"))

Stemming

Stemming, as the name implies, is the act of simplifying a word to the root or main part of a noun, adjective, or verb, to which inflections or suffixes have been added. Common examples include the suffixes of “ed” and “es”. Although this can be extremely helpful, I found that many of the words in my essay were truncated under the stemming assumptions. For instance, customer was shortened to custom. My analysis was more complete without using the stemDocument option but I did, however, remove the plural forms of words, such as products and customers to cut down on the redundancy of terms.

#docs <- tm_map(docs, stemDocument)

Create document term matrix

Now we’re ready to create a document term matrix so we can get an idea of the sparsity of the texts, identify frequenty associated words, and build our word clouds.

A document term matrix, simply put, is a matrix with the documents forming the rows and the terms forming the columns. The frequency of the terms’ occurences populate the matrix cells. The DocumentTermMatrix() function is used to create the matrix.

dtm <- DocumentTermMatrix(docs)

dtm
## <<DocumentTermMatrix (documents: 10, terms: 2467)>>
## Non-/sparse entries: 4605/20065
## Sparsity           : 81%
## Maximal term length: 92
## Weighting          : term frequency (tf)

As we can see, the matrix, at 81% sparsity, is mostly empty space.

Nonetheless, we can still view the row and column counts using the dim() function.

dim(dtm)
## [1]   10 2467

Transpose corpus to term document matrix

tdm <- TermDocumentMatrix(docs)
tdm
## <<TermDocumentMatrix (terms: 2467, documents: 10)>>
## Non-/sparse entries: 4605/20065
## Sparsity           : 81%
## Maximal term length: 92
## Weighting          : term frequency (tf)

Explore term document matrix

By converting the term document matrix into a matrix and summing the column counts, we can view the term frequencies as a vector, freq.

freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 2467

Now, by ordering the frequencies in descending and ascending order, we can see which terms are relevant to our analysis and which terms we can most likely remove.

ord <- order(freq)
# Least frequent terms
freq[head(ord)]
##       abe    abroad    absent absorbing  abundant    abused 
##         1         1         1         1         1         1

With each of these words occuring only once, it’s safe to say they are not all too important. These are but a few of the words that occured only once, and we’ll address these terms soon using a distribution of term frequencies.

freq[tail(ord)]
##   retail delivery   online  product customer   amazon 
##       37       39       50       63       82      439

In contrast, after ordering the terms by the most frequent terms, we can see the words most relevant to our analysis. Obviously, because the paper is concerned entirely with Amazon and it’s strategic management, the most frequent term is Amazon. It also makes sense that customer, product, online, and delivery occur frequently as well.

Distribution of term frequencies

#Lowest frequencies
head(table(freq), 15)
## freq
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1330  414  212  134   87   64   53   28   34   22   14    7   12    8    3
#Highest frequencies
tail(table(freq), 15)
## freq
##  20  22  24  25  26  28  29  31  35  37  39  50  63  82 439 
##   3   4   2   3   2   1   1   1   4   1   1   1   1   1   1

We can see that 1330 words occurred only once within the corpus. As already mentioned, these are terms we are not all too interested in. The terms yielding the highest frequencies are much more relevant to our analysis.

Plotting term frequencies

We can also generate the frequency count of all words in our corpus and plot the frequencies using a bar graph in ggplot.

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)
subset(wf, freq>30) %>%ggplot(aes(word, freq))+geom_bar(stat="identity")+theme(axis.text.x=element_text(angle=45, hjust=1))