-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path1. intro.Rmd
380 lines (260 loc) · 16.7 KB
/
1. intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
---
title: "Intro"
author: "Anthony Kenny"
date: "12 September 2016"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#Introduction
Text mining is the process of distilling actionable insights from text.
The text mining process can be broken up into six components.
1. Problem definition and specific goals.
2. Identify text to be collected
3. Text organistation
4. Feature extraction
5. Analysis
6. Reach an insight, reccommendation or output
# Approaches to text mining
There are two main approaches to text mining, semantic parsing and bag of words.
## Semantic parsing
This is based on word syntax and we care about word type and order. This method creates a lot of features to study, for example, a single word can be tagged as part of a sentence, a noun and also a proper noun or named entity producting three features for that single word. This effect makes semantic parsing feature rich. To do the parsing, it follows a tree structure to break up the text.
## Bag of words
Here words are just attributes of the document. At its heart, bag of words text mining represents a way to count terms, or n-grams, across a collection of documents.
Manually counting words in the sentences above is a pain! Fortunately, the qdap package offers a better alternative. You can easily find the top 4 most frequent terms (including ties) in text by calling the freq_terms function and specifying 4.
Let's do an exercise:
```{r}
text <- c("Text mining usually involves the process of structuring the input text. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.")
new_text <- c("DataCamp is the first online learning platform that focuses on building the best learning experience specifically for Data Science. We have offices in Boston and Belgium and to date, we trained over 250,000 (aspiring) data scientists in over 150 countries. These data science enthusiasts completed more than 9 million exercises. You can take free beginner courses, or subscribe for $25/month to get access to all premium courses.")
```
```{r}
# Load qdap
library(qdap)
# Print new_text to the console
new_text
# Find the 10 most frequent terms: term_count
term_count <- freq_terms(new_text, 10)
# Plot term_count
plot(term_count)
```
## Load some text
Text mining begins with loading some text data into R, which we'll do with the read.csv() function. By default, read.csv() treats character strings as factor levels like Male/Female. To prevent this from happening, it's very important to use the argument stringsAsFactors = FALSE.
A best practice is to examine the object you read in to make sure you know which column(s) are important. The str() function provides an efficient way of doing this. You can also count the number of documents using the nrow() function on the new object. In this example, it will tell you how many coffee tweets are in the vector.
If the data frame contains columns that are not text, you may want to make a new object using only the correct column of text (e.g. some_object$column_name).
```{r}
# Import text data
tweets <- read.csv("coffee.csv", stringsAsFactors = FALSE)
# View the structure of tweets
str(tweets)
# Print out the number of rows in tweets
nrow(tweets)
# Isolate text from tweets: coffee_tweets
coffee_tweets <- tweets$text
```
## Make the vector a VCorpus object (1)
Recall that you've loaded your text data as a vector called coffee_tweets in the last exercise. Your next step is to convert this vector containing the text data to a corpus. As you've learned in the video, a corpus is a collection of documents, but it's also important to know that in the tm domain, R recognizes it as a data type.
There are two kinds of the corpus data type, the permanent corpus, PCorpus, and the volatile corpus, VCorpus. In essence, the difference between the two has to do with how the collection of documents is stored in your computer. In this course, we will use the volatile corpus, which is held in your computer's RAM rather than saved to disk, just to be more memory efficient.
To make a volatile corpus, R needs to interpret each element in our vector of text, coffee_tweets, as a document. And the tm package provides what are called Source functions to do just that! In this exercise, we'll use a Source function called VectorSource() because our text data is contained in a vector. The output of this function is called a Source object. Give it a shot!
```{r}
# Load tm
library(tm)
# Make a vector source: coffee_source
coffee_source <- VectorSource(coffee_tweets)
```
Make the vector a VCorpus object (2)
100xp
Now that we've converted our vector to a Source object, we pass it to another tm function, VCorpus(), to create our volatile corpus. Pretty straightforward, right?
The VCorpus object is a nested list, or list of lists. At each index of the VCorpus object, there is a PlainTextDocument object, which is essentially a list that contains the actual text data (content), as well as some corresponding metadata (meta). It can help to visualize a VCorpus object to conceptualize the whole thing.
For example, to examine the contents of the second tweet in coffee_corpus, you'd subset twice. Once to specify the second PlainTextDocument corresponding to the second tweet and again to extract the first (or content) element of that PlainTextDocument:
coffee_corpus[[15]][1]
```{r}
## coffee_source is already in your workspace
# Make a volatile corpus: coffee_corpus
coffee_corpus <- VCorpus(coffee_source)
# Print out coffee_corpus
coffee_corpus
# Print data on the 15th tweet in coffee_corpus
coffee_corpus[[15]]
# Print the content of the 15th tweet in coffee_corpus
coffee_corpus[[15]][1]
```
## Make a VCorpus from a data frame
Because another common text source is a data frame, there is a Source function called DataframeSource(). The DataframeSource() function treats the entire row as a complete document, so be careful you don't pick up non-text data like customer IDs when sourcing a document this way.
```{r}
num <- c(1,2,3)
Author1 <- c("Text mining is a great time.",
"text analysis provides insights",
"qdap and tm are used in text mining")
Author2 <- c("R is a great language",
"R has many uses",
"DataCamp is cool!")
example_text <- data.frame(num,Author1, Author2)
```
```{r}
# Print example_text to the console
example_text
# Create a DataframeSource on columns 2 and 3: df_source
df_source <- DataframeSource(example_text[,2:3])
# Convert df_source to a corpus: df_corpus
df_corpus <- VCorpus(df_source)
# Examine df_corpus
df_corpus
# Create a VectorSource on column 3: vec_source
vec_source <- VectorSource(example_text[,3])
# Convert vec_source to a corpus: vec_corpus
vec_corpus <- VCorpus(vec_source)
# Examine vec_corpus
vec_corpus
```
## Common cleaning functions from tm
Now that you know two ways to make a corpus, we can focus on cleaning, or preprocessing, the text. First, we'll clean a small piece of text so you can see how it works. Then we will move on to actual corpora.
In bag of words text mining, cleaning helps aggregate terms. For example, it may make sense that the words "miner", "mining" and "mine" should be considered one term. Specific preprocessing steps will vary based on the project. For example, the words used in tweets are vastly different than those used in legal documents, so the cleaning process can also be quite different.
Common preprocessing functions include:
tolower(): Make all characters lowercase
removePunctuation(): Remove all punctuation marks
removeNumbers(): Remove numbers
stripWhitespace(): Remove excess whitespace
Note that tolower() is part of base R, while the other three functions come from the tm package. Going forward, we'll load the tm and qdap for you when they are needed. Every time we introduce a new package, we'll have you load it the first time.
```{r}
# Create the object: text
text <- c("<b>She</b> woke up at 6 A.M. It\'s so early! She was only 10% awake and began drinking coffee in front of her computer.")
# All lowercase
tolower(text)
# Remove punctuation
removePunctuation(text)
# Remove numbers
removeNumbers(text)
# Remove whitespace
stripWhitespace(text)
```
## Cleaning with qdap
The qdap package offers other text cleaning functions. Each is useful in its own way and is particularly powerful when combined with the others.
bracketX(): Remove all text within brackets (e.g. "It's (so) cool" becomes "It's cool")
replace_number(): Replace numbers with their word equivalents (e.g. "2" becomes "two")
replace_abbreviation(): Replace abbreviations with their full text equivalents (e.g. "Sr" becomes "Senior")
replace_contraction(): Convert contractions back to their base words (e.g. "shouldn't" becomes "should not")
replace_symbol() Replace common symbols with their word equivalents (e.g. "$" becomes "dollar")
```{r}
## text is still loaded in your workspace
# Remove text within brackets
bracketX(text)
# Replace numbers with words
replace_number(text)
# Replace abbreviations
replace_abbreviation(text)
# Replace contractions
replace_contraction(text)
# Replace symbols with words
replace_symbol(text)
```
## All about stop words
Often there are words that are frequent but provide little information. So you may want to remove these so-called stop words. Some common English stop words include "I", "she'll", "the", etc. In the tm package, there are 174 stop words on this common list.
In fact, when you are doing an analysis you will likely need to add to this list. In our coffee tweet example, all tweets contain "coffee", so it's important to pull out that word in addition to the common stop words. Leaving it in doesn't add any insight and will cause it to be overemphasized in a frequency analysis.
Using the c() function allows you to add new words (separated by commas) to the stop words list. For example, the following would add "word1" and "word2" to the default list of English stop words:
all_stops <- c("word1", "word2", stopwords("en"))
Once you have a list of stop words that makes sense, you will use the removeWords() function on your text. removeWords() takes two arguments: the text object to which it's being applied and the list of words to remove.
```{r}
## text is preloaded into your workspace
# List standard English stop words
stopwords("en")
# Print text without standard stop words
removeWords(text, stopwords("en"))
# Add "coffee" and "bean" to the list: new_stops
new_stops <- c(stopwords("en"), "coffee", "bean")
# Remove stop words from text
removeWords(text, new_stops)
```
Intro to word stemming and stem completion
100xp
Still another useful preprocessing step involves word stemming and stem completion. The tm package provides the stemDocument() function to get to a word's root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument and returns a PlainTextDocument.
For example,
stemDocument(c("computational", "computers", "computation"))
returns "comput" "comput" "comput". But because "comput" isn't a real word, we want to re-complete the words so that "computational", "computers", and "computation" all refer to the same word, say "computer", in our ongoing analysis.
We can easily do this with the stemCompletion() function, which takes in a character vector and an argument for the completion dictionary. The completion dictionary can be a character vector or a Corpus object. Either way, the completion dictionary for our example would need to contain the word "computer" for all the words to refer to it.
```{r}
# Create complicate
complicate <- c("complicated", "complication", "complicatedly")
# Perform word stemming: stem_doc
stem_doc <- stemDocument(complicate)
# Create the completion dictionary: comp_dict
comp_dict <- c("complicate")
# Perform stem completion: complete_text
complete_text <- stemCompletion(stem_doc, comp_dict)
# Print complete_text
complete_text
```
##Word stemming and stem completion on a sentence
Let's consider the following sentence as our document for this exercise:
"In a complicated haste, Tom rushed to fix a new complication, too complicatedly."
This sentence contains the same three forms of the word "complicate" that we saw in the previous exercise. The difference here is that even if you called stemDocument() on this sentence, it would return the sentence without stemming any words. Take a moment and try it out in the console. Be sure to include the punctuation marks.
This happens because stemDocument() treats the whole sentence as one word. In other words, our document is a character vector of length 1, instead of length n, where n is the number of words in the document. To solve this problem, we first remove the punctation marks with the removePunctuation() function you learned a few exercises back. We then strsplit() this character vector of length 1 to length n, unlist(), then proceed to stem and re-complete.
Don't worry if that was confusing. Let's go through the process step by step!
```{r}
# Remove punctuation: rm_punc
rm_punc <- removePunctuation(text_data)
# Create character vector: n_char_vec
n_char_vec <- unlist(strsplit(rm_punc, split = ' '))
# Perform word stemming: stem_doc
stem_doc <- stemDocument(n_char_vec)
# Print stem_doc
stem_doc
# Re-complete stemmed document: complete_doc
complete_doc <- stemCompletion(stem_doc, comp_dict)
# Print complete_doc
complete_doc
```
##Apply preprocessing steps to a corpus
The tm package provides a special function tm_map() to apply cleaning functions to a corpus. Mapping these functions to an entire corpus makes scaling the cleaning steps very easy.
To save time (and lines of code) it's a good idea to use a custom function like the one displayed in the editor, since you may be applying the same functions over multiple corpora. You can probably guess what the clean_corpus() function does. It takes one argument, corpus, and applies a series of cleaning functions to it in order, then returns the final result.
Notice how the tm package functions do not need content_transformer(), but base R and qdap functions do.
Be sure to test your function's results. If you want to draw out currency amounts, then removeNumbers() shouldn't be used! Plus, the order of cleaning steps makes a difference. For example, if you removeNumbers() and then replace_number(), the second function won't find anything to change! Check, check, and re-check!
```{r}
# Alter the function code to match the instructions
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee", "mug"))
return(corpus)
}
# Apply your customized function to the tweet_corp: clean_corp
clean_corp <- clean_corpus(tweet_corp)
# Print out a cleaned up tweet
clean_corp[[227]][1]
# Print out the same tweet in original form
tweets$text[227]
```
# TDM DTM
##Make a document-term matrix
Hopefully you are not too tired after all this basic text mining work! Just in case, let's revisit the coffee tweets to build a document-term matrix.
Beginning with the coffee.csv file, we have used common transformations to produce a clean corpus called clean_corp.
The document-term matrix is used when you want to have each document represented as a row. This can be useful if you are comparing authors within rows, or the data is arranged chronologically and you want to preserve the time series.
```{r}
# Create the dtm from the corpus: coffee_dtm
coffee_dtm <- DocumentTermMatrix(clean_corp)
# Print out coffee_dtm data
coffee_dtm
# Convert coffee_dtm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_dtm)
# Print the dimensions of coffee_m
dim(coffee_m)
# Review a portion of the matrix
coffee_m[148:150,2587:2590]
```
## Make a term-document matrix
You're almost done with the not-so-exciting foundational work before we get to some fun visualizations and analyses based on the concepts you've learned so far!
In this exercise, you are performing a similar process but taking the transpose of the document-term matrix. In this case, the term-document matrix has terms in the first column and documents across the top as individual column names.
The TDM is often the matrix used for language analysis. This is because you likely have more terms than authors or documents and life is generally easier when you have more rows than columns. An easy way to start analyzing the information is to change the matrix into a simple matrix using as.matrix() on the TDM.
```{r}
# Create a TDM from clean_corp: coffee_tdm
coffee_tdm <- TermDocumentMatrix(clean_corp)
# Print coffee_tdm data
coffee_tdm
# Convert coffee_tdm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)
# Print the dimensions of the matrix
dim(coffee_m)
# Review a portion of the matrix
coffee_m[2587:2590, 148:150]
```