-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path3. More tm techniques.Rmd
274 lines (191 loc) · 10.2 KB
/
3. More tm techniques.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
title: "3. More techniques"
author: "Anthony Kenny"
date: "13 September 2016"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#Introduction
In this section, we cover some miscellaneous (yet very important) text mining subjects including:
* TDM/DTM weighting
* Dealing with TDM/DTM sparsity
* Capturing metadata
* Simple word clustering for topics
* Analysis on more than one word
##Distance matrix and dendrogram
A simple way to do word cluster analysis is with a dendrogram on your term-document matrix. Once you have a TDM, you can call dist() to compute the differences between each row of the matrix.
Next, you call hclust() to perform cluster analysis on the dissimilarities of the distance matrix. Lastly, you can visualize the word frequency distances using a dendrogram and plot(). Often in text mining, you can tease out some interesting insights or word clusters based on a dendrogram.
Consider the table of annual rainfall that you saw in the last video. Cleveland and Portland have the same amount of rainfall, so their distance is 0. You might expect the two cities to be a cluster and for New Orleans to be on its own since it gets vastly more rain.
```{r}
# Create dist_rain
dist_rain <- dist(rain[,2])
# View the distance matrix
dist_rain
# Create hc
hc <- hclust(dist_rain)
# Plot hc
plot(hc, labels = rain$city)
```
##Make a distance matrix and dendrogram from a TDM
Now that you understand the steps in making a dendrogram, you can apply them to text. But first, you have to limit the number of words in your TDM using removeSparseTerms() from tm. Why would you want to adjust the sparsity of the TDM/DTM?
TDMs and DTMs are sparse, meaning they contain mostly zeros. Remember that 1000 tweets can become a TDM with over 3000 terms! You won't be able to easily interpret a dendrogram that is so cluttered, especially if you are working on more text.
A good TDM has between 25 and 70 terms. The higher the sparse value, the more terms are kept. The closer it is to 1, the more that are kept. This value is a percentage cutoff of zeros for each term in the TDM.
```{r}
# Print the dimensions of tweets_tdm
dim(tweets_tdm)
# Create tdm1
tdm1 <- removeSparseTerms(tweets_tdm, sparse = 0.95)
# Create tdm2
tdm2 <- removeSparseTerms(tweets_tdm, sparse = 0.975)
# Print tdm1
tdm1
# Print tdm2
tdm2
```
##Put it all together: a text based dendrogram
Its time to put your skills to work to make your first text-based dendrogram. Remember, dendrograms reduce information to help you make sense of the data. This is much like how an average tells you something, but not everything, about a population. Both can be misleading. With text, there are often a lot of nonsensical clusters, but some valuable clusters may also appear.
A peculiarity of TDM and DTM objects is that you have to convert them first to matrices (with as.matrix()), then to data frames (with as.data.frame()), before using them with the dist() function.
For the chardonnay tweets, you may have been surprised to see the soul music legend Marvin Gaye appear in the word cloud. Let's see if the dendrogram picks up the same.
```{r}
# Create tweets_tdm2
tweets_tdm2 <- removeSparseTerms(tweets_tdm, sparse = 0.975)
# Create tdm_m
tdm_m <- as.matrix(tweets_tdm2)
# Create tdm_df
tdm_df <- as.data.frame(tdm_m)
# Create tweets_dist
tweets_dist <- dist(tdm_df)
# Create hc
hc <- hclust(tweets_dist)
# Plot the dendrogram
plot(hc)
```
##Dendrogram aesthetics
So you made a dendrogram...but its not as eye catching as you had hoped!
The dendextend package can help your audience by coloring branches and outlining clusters. dendextend is designed to operate on dendrogram objects, so you'll have to change the hierarchical cluster from hclust using as.dendrogram().
A good way to review the terms in your dendrogram is with the labels() function. It will print all terms of the dendrogram. To highlight specific branches, use branches_attr_by_labels(). First, pass in the dendrogram object, then a vector of terms as in c("data", "camp"). Lastly add a color such as "blue".
After you make your plot, you can call out clusters with rect.dendrogram(). This adds rectangles for each cluster. The first argument to rect.dendrogram() is the dendrogram, followed by the number of clusters (k). You can also pass a border argument specifying what color you want the rectangles to be (e.g. "green").
```{r}
# Load dendextend
library(dendextend)
# Create hc
hc <- hclust(tweets_dist)
# Create hcd
hcd <- as.dendrogram(hc)
# Print the labels in hcd
labels(hcd)
# Change the branch color to red for "marvin" and "gaye"
hcd <- branches_attr_by_labels(hcd, c("marvin", "gaye"), "red")
# Plot hcd
plot(hcd, main = "Better Dendrogram")
# Add cluster rectangles
rect.dendrogram(hcd, k = 2, border = "grey50")
```
Using word association
100xp
Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together, while a score of 0 means that they never appear together.
To use findAssocs() pass in a TDM or DTM, the search term, and a minimum correlation. The function will return a list of all other terms that meet or exceed the minimum threshold.
findAssocs(tdm, "word", 0.25)
Minimum correlation values are often relatively low because of word diversity. Don't be surprised if 0.10 demonstrates a strong pairwise term association.
The coffee tweets have been cleaned and organized into tweets_tdm for the exercise. You will search for a term association, and manipulate the results with list_vect2df() from qdap and then create a plot with the ggplot2 code in the example script.
```{r}
# Create associations
associations <- findAssocs(tweets_tdm, "venti", 0.2)
# View the venti associations
associations
# Create associations_df
associations_df <- list_vect2df(associations)[,2:3]
# Plot the associations_df values (don't change this)
ggplot(associations_df, aes(y = associations_df[, 1])) +
geom_point(aes(x = associations_df[, 2]),
data = associations_df, size = 3) +
theme_gdocs()
```
Changing n-grams
So far, we have only made TDMs and DTMs using single words. The default is to make them with unigrams, but you can also focus on tokens containing two or more words. This can help extract useful phrases which lead to some additional insights or provide improved predictive attributes for a machine learning algorithm.
The function below uses the RWeka package to create trigram (three word) tokens: min and max are both set to 3.
tokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 3, max = 3))
Then the customized tokenizer() function can be passed into the TermDocumentMatrix or DocumentTermMatrix functions as an additional parameter:
tdm <- TermDocumentMatrix(
corpus,
control = list(tokenize = tokenizer)
)
```{r}
# Make tokenizer function
tokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 2, max = 2))
# Create unigram_dtm
unigram_dtm <- DocumentTermMatrix(text_corp)
# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(
text_corp,
control = list(tokenize = tokenizer)
)
# Examine unigram_dtm
unigram_dtm
# Examine bigram_dtm
bigram_dtm
```
##How do bigrams affect word clouds?
Now that you have made a bigram DTM, you can examine it and remake a word cloud. The new tokenization method affects not only the matrices, but also any visuals or modeling based on the matrices.
Remember how "Marvin" and "Gaye" were separate and large terms in the chardonnay word cloud? Using bigram tokenization grabs all two word combinations. Observe what happens to the word cloud in this exercise.
```{r}
# Create bigram_dtm_m
bigram_dtm_m <- as.matrix(bigram_dtm)
# Create freq
freq <- colSums(bigram_dtm_m)
# Create bi_words
bi_words <- names(freq)
# Examine part of bi_words
bi_words[2577:2587]
# Plot a wordcloud
wordcloud(bi_words, freq,
max.words = 15)
```
##Changing frequency weights
So far you have used term frequency to make the DocumentTermMatrix or TermDocumentMatrix. There are other term weights that can be helpful. The most popular weight is TfIdf, which stands for term frequency-inverse document frequency.
The TfIdf score increases by term occurrence but is penalized by the frequency of appearance among all documents.
From a common sense perspective, if a term appears often it must be important. This attribute is represented by term frequency (i.e. Tf), which is normalized by the length of the document. However, if the term appears in all documents, it is not likely to be insightful. This is captured in the inverse document frequency (i.e. Idf).
The wiki page on TfIdf contains the mathematical explanation behind the score, but the exercise will demonstrate the practical difference.
```{r}
# Create tf_tdm
tf_tdm <- TermDocumentMatrix(text_corp)
# Create tfidf_tdm
tfidf_tdm <- TermDocumentMatrix(text_corp,
control = list(wieghting = weightTfIdf))
# Create tf_tdm_m
tf_tdm_m <- as.matrix(tf_tdm)
# Create tfidf_tdm_m
tfidf_tdm_m <- as.matrix(tfidf_tdm)
# Examine part of tf_tdm_m
tf_tdm_m[508:509,5:10]
# Examine part of tfidf_tdm_m
tfidf_tdm_m[508:509,5:10]
```
##Capturing metadata in tm
Depending on what you are trying to accomplish, you may want to keep metadata about the document when you create a TDM or DTM. This metadata can be incorporated into the corpus fairly easily by creating a readerControl list and applying it to a DataframeSource when calling VCorpus().
You will need to know the column names of the data frame containing the metadata to be captured. The names() function is helpful for this.
To capture the text column of the coffee tweets text along with a metadata column of unique numbers called num you would use the code below.
```{r}
# Add author to custom reading list
custom_reader <- readTabular(
mapping = list(content = "text",
id = "num",
author = "screenName",
date = "created")
)
# Make corpus with custom reading
text_corpus <- VCorpus(
DataframeSource(tweets),
readerControl = list(reader = custom_reader)
)
# Clean corpus
text_corpus <- clean_corpus(text_corpus)
# Print data
text_corpus[[1]][1]
# Print metadata
text_corpus[[1]][2]
```