-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path4. Case Study.Rmd
300 lines (219 loc) · 10 KB
/
4. Case Study.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: "4. Amazon Vs. Google"
author: "Anthony Kenny"
date: "13 September 2016"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#Step 2: Identifying the text sources
Employee reviews can come from various sources. If your human resources department had the resources, you could have a third party administer focus groups to interview employees both internally and from your competitor.
Forbes and others publish articles about the "best places to work", which may mention Amazon and Google. Another source of information might be anonymous online reviews from websites like [Indeed](http://www.indeed.com/cmp/Amazon.com/reviews), [Glassdoor](https://www.glassdoor.ie/index.htm?countryRedirect=true) or [CareerBliss.](https://www.careerbliss.com/amazon/reviews/)
Here, we'll focus on a collection of anonymous online reviews.
```{r}
# Print the structure of amzn
str(amzn)
# Create amzn_pros
amzn_pros <- amzn$pros
# Create amzn_cons
amzn_cons <- amzn$cons
# Print the structure of goog
str(goog)
# Create goog_pros
goog_pros <- goog$pros
# Create goog_cons
goog_cons <- goog$cons
```
#Text organization
Now that you have selected the exact text sources, you are ready to clean them up. You'll be using the two functions you just saw in the video: qdap_clean(), which applies a series of qdap functions to a text vector, and tm_clean(), which applies a series of tm functions to a corpus object. You can refer back to the video to remind yourself of how they work.
In order to keep things simple, the functions have been defined for you and are available in your workspace. It's your job to apply them to amzn_pros and amzn_cons!
```{r}
qdap_clean <- function(x){
x <- replace_abbreviation(x)
x <- replace_contraction(x)
x <- replace_number(x)
x <- replace_ordinal(x)
x <- replace_ordinal(x)
x <- replace_symbol(x)
x <- tolower(x)
return(x)
}
tm_clean <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords,
c(stopwords("en"), "Google", "Amazon", "company"))
return(corpus)
}
# Alter amzn_pros
amzn_pros <- qdap_clean(amzn_pros)
# Alter amzn_cons
amzn_cons <- qdap_clean(amzn_cons)
# Create az_p_corp
az_p_corp <- VCorpus(VectorSource(amzn_pros))
# Create az_c_corp
az_c_corp <- VCorpus(VectorSource(amzn_cons))
# Create amzn_pros_corp
amzn_pros_corp <- tm_clean(az_p_corp)
# Create amzn_cons_corp
amzn_cons_corp <- tm_clean(az_c_corp)
```
##Working with Google reviews
Now that the Amazon reviews have been cleaned, the same must be done for the Google reviews. qdap_clean() and tm_clean() are available in your workspace to help you clean goog_pros and goog_cons.
```{r}
# Apply qdap_clean to goog_pros
goog_pros <- qdap_clean(goog_pros)
# Apply qdap_clean to goog_cons
goog_cons <- qdap_clean(goog_cons)
# Create goog_p_corp
goog_p_corp <- VCorpus(VectorSource(goog_pros))
# Create goog_c_corp
goog_c_corp <- VCorpus(VectorSource(goog_cons))
# Create goog_pros_corp
goog_pros_corp <- tm_clean(goog_p_corp)
# Create goog_cons_corp
goog_cons_corp <- tm_clean(goog_c_corp)
```
##Feature extraction & analysis: amzn_pros
amzn_pros_corp, amzn_cons_corp, goog_pros_corp and goog_cons_corp have all been preprocessed, so now you can extract the features you want to examine. Since you are using the bag of words approach, you decide to create a bigram TermDocumentMatrix for Amazon's positive reviews corpus, amzn_pros_corp. From this, you can quickly create a wordcloud() to understand what phrases people positively associate with working at Amazon.
The function below uses RWeka to tokenize two terms and is used behind the scenes in this exercise.
tokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 2, max = 2))
```{r}
# Create amzn_p_tdm
amzn_p_tdm <- TermDocumentMatrix(
amzn_pros_corp,
control = list(tokenize = tokenizer)
)
# Create amzn_p_tdm_m
amzn_p_tdm_m <- as.matrix(amzn_p_tdm)
# Create amzn_p_freq
amzn_p_freq <- rowSums(amzn_p_tdm_m)
# Plot a wordcloud using amzn_p_freq values
wordcloud(names(amzn_p_freq), amzn_p_freq,
max.words = 25, color = "blue")
```
##Feature extraction & analysis: amzn_cons
You now decide to contrast this with the amzn_cons_corp corpus in another bigram TDM. Of course, you expect to see some different phrases in your word cloud.
Once again, you will use this custom function to extract your bigram features for the visual:
tokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 2, max = 2))
```{r}
# Create amzn_c_tdm
amzn_c_tdm <- TermDocumentMatrix(
amzn_cons_corp,
control = list(tokenize = tokenizer)
)
# Create amzn_c_tdm_m
amzn_c_tdm_m <- as.matrix(amzn_c_tdm)
# Create amzn_c_freq
amzn_c_freq <- rowSums(amzn_c_tdm_m)
# Plot a wordcloud of negative Amazon bigrams
wordcloud(names(amzn_c_freq), amzn_c_freq,
max.words = 25, colors = "red")
```
##amzn_cons dendrogram
It seems there is a strong indication of long working hours and poor work-life balance in the reviews. As a simple clustering technique, you decide to perform a hierarchical cluster and create a dendrogram to see how connected these phrases are.
```{r}
# Create amzn_c_tdm
amzn_c_tdm <- TermDocumentMatrix(
amzn_cons_corp,
control = list(tokenize = tokenizer)
)
# Print amzn_c_tdm to the console
amzn_c_tdm
# Create amzn_c_tdm2 by removing sparse terms
amzn_c_tdm2 <- removeSparseTerms(amzn_c_tdm, .993)
# Create hc as a cluster of distance values
hc <- hclust(dist(amzn_c_tdm2, method = "euclidean"),
method = "complete")
# Produce a plot of hc
plot(hc)
```
##Word association
As expected, you see similar topics throughout the dendrogram. Switching back to positive comments, you decide to examine top phrases that appeared in the word clouds. You hope to find associated terms using the findAssocs()function from tm. You want to check for something surprising now that you have learned of long hours and a lack of work-life balance.
```{r}
# Create amzn_p_tdm
amzn_p_tdm <- TermDocumentMatrix(
amzn_pros_corp,
control = list(tokenize = tokenizer)
)
# Create amzn_p_m
amzn_p_m <- as.matrix(amzn_p_tdm)
# Create amzn_p_freq
amzn_p_freq <- rowSums(amzn_p_m)
# Create term_frequency
term_frequency <- sort(amzn_p_freq, decreasing = TRUE)
# Print the 5 most common terms
term_frequency[1:5]
# Find associations with fast paced
findAssocs(amzn_p_tdm, "fast paced", 0.2)
```
Quick review of Google reviews
100xp
You decide to create a comparison.cloud() of Google's positive and negative reviews for comparison to Amazon. This will give you a quick understanding of top terms without having to spend as much time as you did examining the Amazon reviews in the previous exercises.
We've provided you with a corpus all_goog_corpus, which has the 500 positive and 500 negative reviews for Google. Here, you'll clean the corpus and create a comparison cloud comparing the common words in both pro and con reviews.
```{r}
# Create all_goog_corp
all_goog_corp <- tm_clean(all_goog_corpus)
# Create all_tdm
all_tdm <- TermDocumentMatrix(all_goog_corp)
# Name the columns of all_tdm
colnames(all_tdm) <- c("Goog_Pros", "Goog_Cons")
# Create all_m
all_m <- as.matrix(all_tdm)
# Build a comparison cloud
comparison.cloud(all_m,
colors = c("#F44336", "#2196f3"),
max.words = 100)
```
Cage match! Amazon vs. Google pro reviews
100xp
Amazon's positive reviews appear to mention bigrams such as "good benefits", while its negative reviews focus on bigrams such as "work load" and "work-life balance" issues.
In contrast, Google's positive reviews mention "great food", "perks", "smart people", and "fun culture", among other things. Google's negative reviews discuss "politics", "getting big", "bureaucracy", and "middle management".
You decide to make a pyramid plot lining up positive reviews for Amazon and Google so you can adequately see the differences between any shared birgrams.
```{r}
# Create common_words
common_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)
# Create difference
difference <- abs(common_words[, 1] - common_words[, 2])
# Add difference to common_words
common_words <- cbind(common_words, difference)
# Order the data frame from most differences to least
common_words <- common_words[order(common_words[, 3], decreasing = TRUE), ]
# Create top15_df
top15_df <- data.frame(x = common_words[1:15, 1],
y = common_words[1:15, 2],
labels = rownames(common_words[1:15, ]))
# Create the pyramid plot
pyramid.plot(top15_df$x, top15_df$y,
labels = top15_df$labels, gap = 12,
top.labels = c("Amzn", "Pro Words", "Google"),
main = "Words in Common", unit = NULL)
```
Cage match, part 2! Negative reviews
100xp
Interestingly, some Amazon employees discussed "work-life balance" as a positive. In both organizations, people mentioned "culture" and "smart people", so there are some similar positive aspects between the two companies.
You now decide to turn your attention to negative reviews and make the same visual. This time, all_tdm_m contains the negative reviews, or cons, from both organizations.
```{r}
# Create common_words
common_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)
# Create difference
difference <- abs(common_words[, 1] - common_words[, 2])
# Bind difference to common_words
common_words <- cbind(common_words, difference)
# Order the data frame from most differences to least
common_words <- common_words[order(common_words[, 3], decreasing = TRUE), ]
# Create top15_df
top15_df <- data.frame(x = common_words[1:15, 1],
y = common_words[1:15, 2],
labels = rownames(common_words[1:15, ]))
# Create the pyramid plot
pyramid.plot(top15_df$x, top15_df$y,
labels = top15_df$labels, gap = 12,
top.labels = c("Amzn", "Cons Words", "Google"),
main = "Words in Common", unit = NULL)
```
Based on the visual, does Amazon or Google have a better work-life balance according to current employee reviews?
Google.