-
Notifications
You must be signed in to change notification settings - Fork 24
/
baltimore.Rmd
699 lines (505 loc) · 28.7 KB
/
baltimore.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
A quick analysis of Baltimore crime
========================================================
```{r setup, echo=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```
I'm going to do a very simple analysis of Baltimore crime to show off R. We'll use data downloaded from Baltimore City's awesome open data site (this was downloaded a couple of years ago so if you download now, you will get different results).
### Getting data
* Arrest data: https://data.baltimorecity.gov/Crime/BPD-Arrests/3i3v-ibrt
* CCTV data: https://data.baltimorecity.gov/Crime/CCTV-Locations/hdyb-27ak
Let's load the data:
```{r}
arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
cctv_tab=read.csv("CCTV_Locations.csv", stringsAsFactors=FALSE)
# these columns are mislabeled, so fix them
tmp=arrest_tab$sex
arrest_tab$sex=arrest_tab$race
arrest_tab$race=tmp
```
### Exploring data
```{r}
# dimension of table (data.frame)
dim(arrest_tab)
# what are the columns
names(arrest_tab)
# what is the average arrest age?
mean(arrest_tab$age)
# the range of arrest ages
range(arrest_tab$age)
# how many arrests per sex
table(arrest_tab$sex)
# what are the most common offenses
head(sort(table(arrest_tab$incidentOffense),decreasing=TRUE))
# what are the offenses that only happen once
tab <- table(arrest_tab$incidentOffense)
tab[tab == 1]
# range of arrests after removing those w/ age==0
range(arrest_tab$age[arrest_tab$age>0])
```
Offenses by sex
```{r}
tab <- table(arrest_tab$incidentOffense, arrest_tab$sex)
```
Let's see a table of arrests by sex and race
```{r}
table(sex=arrest_tab$sex,race=arrest_tab$race)
```
A histogram of age
```{r}
hist(arrest_tab$age,nc=100)
with(arrest_tab,hist(age[sex=="M"],nc=100)) # males only
with(arrest_tab,hist(age[sex=="F"],nc=100)) # females only
```
### Are males and females arrested at different ages on average?
Let's take a look at how age depends on sex. Let's plot age as a function of sex first (notice how we indicate that sex is a `factor`).
```{r}
plot(arrest_tab$age~factor(arrest_tab$sex))
```
One of the neat things about R is that statistical model building and testing is built-in. The model we use is $y_i=\beta_0+\beta_1 x_i$ where $y_i$ is age of sample (example) $i$ and $x_i$ is an indicator variable $x_i \in \{0,1\}$ with $x_i=1$ if the $i$-th record (example) is male. You can check that $\beta_1$ is the difference in mean age between females and males.
We use the formula syntax to build a linear regression model.
```{r}
# let's ignore those records with missing sex
fit=lm(age~factor(sex),data=arrest_tab,subset=arrest_tab$sex %in% c("M","F"))
summary(fit)
```
We see that $\beta_1 \approx -0.2$ meaning that the arrest age for males is about 2.5 months younger. So there is very little difference in the average age (which is what the linear model is testing) but we see that the probability of observing this difference from a sample of this size **when there is no difference in average age** is small $p \approx 0.01$. Since we have a very large number of examples, or records, this testing framework will declare very small differences as *statistically significant*. We'll return to this theme later in class.
### Geographic distribution of arrests.
First we need to extract latitude and longitude from location, we'll use some string functions to do this
```{r}
tmp=gsub("\\)","",gsub("\\(","",arrest_tab$Location))
tmp=strsplit(tmp,split=",")
arrest_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
arrest_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
```
Now let's plot
```{r}
plot(arrest_tab$lon, arrest_tab$lat, xlab="Longitude", ylab="Latitude", main="Arrests in Baltimore")
```
We can also use density estimates to make this nicer:
```{r}
smoothScatter(arrest_tab$lat, arrest_tab$lon, xlab="Latitude", ylab="Longitude", main="Arrests in Baltimore")
```
Let's make this fancier using the `ggplot2` graphics systems and the `maps` package containing map data.
```{r}
library(maps)
library(ggplot2)
balto_map = subset(map_data("county", region="maryland"),subregion=="baltimore city")
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=arrest_tab,aes(x=lon,y=lat),color="blue",alpha=.1)
print(plt)
```
Now let's add CCTV cameras.
```{r}
tmp=gsub("\\)","",gsub("\\(","",cctv_tab$Location))
tmp=strsplit(tmp,split=",")
cctv_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
cctv_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=arrest_tab,aes(x=lon,y=lat),color="blue",alpha=.1)
plt=plt+geom_point(data=cctv_tab,aes(x=lon,y=lat),color="red")
print(plt)
```
### A challenge
Is there any relationship between the number of CCTV cameras and the number of arrests? Divide the city into a grid and plot the number of CCTV cameras vs. the number of arrests.
```{r}
# step 1: divide city intro grid for arrest data
# step 1a: find the range of latitude and longitude
latRange=range(arrest_tab$lat,na.rm=TRUE)
lonRange=range(arrest_tab$lon,na.rm=TRUE)
# step 1b: discretize latitude into 50 bins
latGrid=seq(min(latRange),max(latRange),len=50)
latFactor=cut(arrest_tab$lat,breaks=latGrid)
# now longitude
lonGrid=seq(min(lonRange),max(lonRange),len=50)
lonFactor=cut(arrest_tab$lon,breaks=lonGrid)
# step 1c: make a factor indicating geographic grid location
gridFactor=factor(paste(lonFactor,latFactor,sep=":"))
# step 2: do the same for the cctv data
latFactor=cut(cctv_tab$lat,breaks=latGrid)
lonFactor=cut(cctv_tab$lon,breaks=lonGrid)
cctvGridFactor=factor(paste(lonFactor,latFactor,sep=":"))
arrestTab=table(gridFactor)
cctvTab=table(cctvGridFactor)
m=match(names(cctvTab),names(arrestTab))
plot(arrestTab[m]~factor(cctvTab),xlab="Number of CCTV cameras", ylab="Number of Arrests")
```
### Extra analyses
As part of Project 1 you will add to this analysis. Please use the following template:
#### Mihai Sirbu
What question are you asking?:
I am trying to answer: at what time are most people arrested?
For this prelimary analysis, I plan on making a plot where
hour is the x-axis and the number of arrest is the y-axis. This
will produced an "Arrest Timeseries.
What is the code you use to answer it?:
```{r surbu}
time <- strptime(arrest_tab$arrestTime, "%H:%M")
arrest_tab$hours <- as.numeric(format(time, "%H"))
hours_df <- as.data.frame(table(arrest_tab$hours))
names(hours_df) <- c("hour","count")
g <- ggplot(hours_df, aes(hour, count, group=1))+geom_line(color="blue")+geom_point(color="blue")
g <- g+labs(title = "Arrest Timeseries", x="Time of Day",y="Num of Arrests")
g <- g+scale_x_discrete(breaks=seq(0,23,2))
g <- g + theme(plot.title=element_text(size=16,face="bold"),axis.title.x=element_text(size=16,face="bold"),axis.title.y=element_text(size=16,face="bold"))
g
```
What did you observe?
I had originally thought that there would be very little arrests until 8 pm at which point there would be a giant spike from 8 pm to 5 am. But that was not the case. Instead, the two biggest hours of arrest were 6 pm followed by 10 am (!!). At this point, I'm not entirely sure why that might be. I would be surprised, however, if all offenses followed this exact same pattern.
#### Aaron Dugatkin
What question are you asking?: I am trying to find out how cameras affect the sorts of crimes in their area, both in reducing certain types of crime, or leading to finding more of other types of crime.
What is the code you use to answer it?:
```{r aarondugatkin}
# modified code from above, to create factors, but remove NA
# added by HCB to restore original arrest table
arrest_tab_original = arrest_tab
#
arrest_tab = arrest_tab[!is.na(arrest_tab$lat) & !is.na(arrest_tab$lon),]
latRange=range(arrest_tab$lat,na.rm=TRUE)
lonRange=range(arrest_tab$lon,na.rm=TRUE)
latGrid=seq(min(latRange),max(latRange),len=50)
latFactor=cut(arrest_tab$lat,breaks=latGrid)
lonGrid=seq(min(lonRange),max(lonRange),len=50)
lonFactor=cut(arrest_tab$lon,breaks=lonGrid)
gridFactor=factor(paste(lonFactor,latFactor,sep=":"))
latFactor=cut(cctv_tab$lat,breaks=latGrid)
lonFactor=cut(cctv_tab$lon,breaks=lonGrid)
cctvGridFactor=factor(paste(lonFactor,latFactor,sep=":"))
arrestTab=table(gridFactor)
cctvTab=table(cctvGridFactor)
#count crimes in areas with and without camera
arrestOnCamera = gridFactor %in% names(cctvTab)
count_crime_tab <- table(arrest_tab$incidentOffense, arrestOnCamera)
#merge the two tables, and calculate the difference in crime frequency in the two situations
crime_tab <- data.frame(count_crime_tab[,1], count_crime_tab[,2])
colnames(crime_tab)[1] <- "noCamCrimes"
colnames(crime_tab)[2] <- "camCrimes"
crime_tab$names <- rownames(crime_tab)
crime_tab$campct <- crime_tab$camCrimes/sum(crime_tab$camCrimes)*100
crime_tab$nocampct <- crime_tab$noCamCrimes/sum(crime_tab$noCamCrimes)*100
crime_tab$pctchange <- crime_tab$campct - crime_tab$nocampct
#display the change in crime frequency with crime name in descending order, with the most increased (caught) crimes first
crime_tab <- crime_tab[with(crime_tab, order(-pctchange)), ]
options(scipen=999)
subset(crime_tab, select=c("pctchange"))
# added by HCB to restore original arrest table
arrest_tab = arrest_tab_original
```
What did you observe? The results were interesting. We see a large increase in charges of narcotics, which may be due to camera surveillance. We also see a decrease in assault, which may be due to the perpetrators of such crimes realizing the dangers of committing such crimes in front of a camera. However, the vast majority of crimes do not even see a 1% change between the two situations, so it would appear as though, overall, cameras do not have a major affect on criminal activity.
#### Anna Petrone
What question are you asking?:
Which neighborhoods in Baltimore have the higest number of arrests?
What is the code you use to answer it?:
Load libraries
```{r AnnaPetrone:libs}
library(rgdal) # needed for reading shape files
library(plyr) # needed for rename function
library(sp) # needed for point.in.polygon function
library(ggmap) # could use for geocoding addresses
library(ggplot2) # needed for plotting
```
Find number of arrests for which the geo coordinates weren't given
```{r AnnaPetrone:nogeo}
no.geo.idx = nchar(arrest_tab$Location.1) == 0
n.geo.missing = sum( no.geo.idx )
narrests = dim(arrest_tab)[1]
n.geo.missing/narrests*100 # 39%
```
Find the number of incidents who dont have geo code info, but the incidentLocation is provided
```{r AnnaPetrone:nogeo_butloc}
has.location = nchar(arrest_tab$incidentLocation) > 0
sum(no.geo.idx & has.location)
```
```{r AnnaPetrone:geocode}
#tmp = paste(arrest_tab$incidentLocation[no.geo.idx & has.location], "Baltimore, MD")
#gc = geocode(tmp[1:2490]) # restricted to 2500 api requests per day
```
Get the 2010 statistical community areas [here](http://bniajfi.org/mapping-resources/)
Download the shape files and extract from the .zip file
```{r AnnaPetrone:getshape}
setwd("csa_2010_boundaries/")
csa = readOGR(dsn=".",layer="CSA_NSA_Tracts")
csa.df = fortify(csa) # fortify turns the shape data into a data.frame
csa.df = rename(csa.df, c("long"="X.feet","lat"="Y.feet")) # MD uses State Plane coords instead of lat/lon (see comments section)
convert = FALSE
if (convert){ # write a file to send to matlab code (described in comments section)
write.csv(csa.df[,c("lat","lon")], "csa-df.txt", quote=FALSE, na="",row.names=FALSE)
}
csa.converted.df = read.csv("csa-df_converted.txt",header=FALSE) # output of the matlab code, converted to lat/lon
setwd("..")
csa.converted.df = rename(csa.converted.df, c("V1"="lat","V2"="lon"))
csa.df = cbind(csa.df, csa.converted.df)
```
Now assign each of the arrest records to a neighborhood
but this is only possible for the records that have geo info.
This step takes about 15-20 seconds
```{r AnnaPetrone:assign}
ncsa = dim(csa)[1]
arrest_tab_geo = arrest_tab[!no.geo.idx,]
narrests.geo = dim(arrest_tab_geo)[1]
arrest_nbhd_id = vector(length = narrests.geo)
for (j in 1:ncsa) { # takes about 30 sec
idx = csa.df$id == j-1
polyx = csa.df$lon[idx]
polyy = csa.df$lat[idx]
in.poly = point.in.polygon(arrest_tab_geo$lon, arrest_tab_geo$lat, polyx,polyy)
in.poly= as.logical(in.poly)
arrest_nbhd_id[in.poly] = j - 1
}
arrest_tab_geo = cbind(arrest_tab_geo, arrest_nbhd_id)
```
For each neighborhood, count the number of arrests, using the table function
```{r AnnaPetrone:narrests}
nbhd.narrests = as.data.frame(table(arrest_nbhd_id))
nbhd.narrests = rename(nbhd.narrests, c("arrest_nbhd_id"="id", "Freq"="narrests"))
nbhd.names= as.vector(csa$Neigh)
nbhd.narrests = cbind(nbhd.names, nbhd.narrests)
head(nbhd.narrests)
```
Merge the arrest counts with the geometry data
```{r AnnaPetrone:narrests.merge}
csa.df = merge(csa.df, nbhd.narrests, by="id",all.x=TRUE)
```
Make a plot colored by number of arrests
```{r AnnaPetrone_plot}
g = ggplot(csa.df,aes(x=lon,y=lat,group=group))
g = g + geom_polygon(aes(fill=narrests)) + scale_fill_gradient(low="slategray1",high="slateblue4") # color the nbhds by narrests
g = g + geom_path(colour="gray75",size=.1) # draw lines separating the neighborhoods
g = g + ggtitle("Baltimore City Arrests 2011 - 2012") # add a title
g = g + theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(),
axis.title.x = element_blank(), axis.title.y = element_blank()) # remove axis labels
print(g)
```
What did you observe?
First, it should be noted that out of the 104,528 arrest records, 40,636 of them (about 39%) did not have geocoded locations (latitude and longitude). Some of them (7,650) did have and adress in the incidentLocation field, so it would be possible to geocode these, though it would not contribute enormously. (I did however take a look at the ggmap library which can convert an address string into lat and lon by calling the google maps geocoding API, however it only allows 2,500 requests per day.) Therefore my analysis only includes the 61% of records which provided geocoded information.
Second, I need to note that it was a pain converting from the [MD State Plane coordinates](http://en.wikipedia.org/wiki/State_Plane_Coordinate_System) into longitude and latitude. I ended up using [an external matlab function](http://www.mathworks.com/matlabcentral/fileexchange/26413-sp-proj) to do the conversion, since it seemed really confusing to do in R.
The results: As could probably be expected, the highest number of arrests occured in the downtown area, with the northwest area being notably high as well. The inner harbor neighborhood is among the lowest which makes sense as this area is more touristy. The neighborhoods in the central northern neighborhoods are also on the low end (I don't know Baltiore but I'm guessing these are higher income neighborhoods).
For future analysis, it would be good to create a similar plot where the colors represent neighborhood income level. I would also like to add a layer showing the locations of transit stations, since these are commonly believed to attract crime.
#### Raul Alfaro
What question are you asking?: Which is the most common crime per race?
What is the code you use to answer it?:
```{r RaulAlfaro}
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="A"]),decreasing=TRUE))
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="B"]),decreasing=TRUE))
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="H"]),decreasing=TRUE))
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="I"]),decreasing=TRUE))
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="U"]),decreasing=TRUE))
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="W"]),decreasing=TRUE))
```
What did you observe?
I observed asside from the Unknown Offenses the most common crime for all races but 1 was Narcotics, the "race" had common Assault as their most common crime.
#### Kim St. Andrie, Rain Surasorn
What question are you asking?:
Which year had the largest number of arrests?
What is the code you use to answer it?:
```{r KimRain}
a <- data.frame(id = arrest_tab$arrest, year = substr(arrest_tab$arrestDate,7,11))
head(sort(table(a$year), decreasing=TRUE),10)
```
What did you observe?
2011 had the largest number of arrests but there was no dramatic difference between the number of arrests for each year. There were 52,868 arrests in 2011 which was 1208 more than the number of arrests for 2012.
#### Rentao Wu
What are you asking:
I wanted to know if the ratio of female to male crime rates are similar accross the difference races.
What is the code you used to answer it?
```{r}
mytab = table(race=arrest_tab$race, sex=arrest_tab$sex)
#mydf$V1 = NULL
mydf = as.data.frame.matrix(mytab)
mydf$ratio = mydf$F/mydf$M
mydf$ratio <- round(mydf$ratio, 3)
mydf = mydf[-1,]
mydf
mydf$race <- c("A","B","H","I","U","W")
ggplot(data=mydf, aes(x=race, y=ratio, fill=race)) + geom_bar(stat="identity")
```
What are your observations?
I found out that for most race, the ratio of female to male crime rates is about 0.2. This tells us that there are about 1 female for ever 5 male criminal offenses. I also saw that the female to male crime ratio for the white population is about 0.43 which is much higher than the others.
#### Krishna Pai
What question are you asking?: Do police officers go out of their way to arrest more black people than white people?
What is the code you use to answer it?:
```{r kpai}
library(ggplot2)
# added by HCB to not dirty global environment
kpai=function()
{
arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
tmp=arrest_tab$sex
arrest_tab$sex=arrest_tab$race
arrest_tab$race=tmp
police_tab=read.csv("Police_Stations.csv", stringsAsFactors=FALSE)
tmp=gsub("\\)","",gsub("\\(","",arrest_tab$Location))
tmp=strsplit(tmp,split=",")
arrest_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
arrest_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
tmp=gsub("\\)","",gsub("\\(","",police_tab$Location))
tmp=strsplit(tmp,split=",")
police_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
police_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
plt=ggplot()
plt=plt+geom_point(data=arrest_tab[arrest_tab$race=='B',],aes(x=lon,y=lat),color="black",alpha=.1)
plt=plt+geom_point(data=arrest_tab[arrest_tab$race=='W',],aes(x=lon,y=lat),color="white",alpha=.1)
plt=plt+geom_point(data=police_tab,aes(x=lon,y=lat),color="red")
print(plt)
}
kpai()
```
What did you observe? I was surprised to find that most of the arrests made some distance away from the cluster of police stations were for white people. It would be interesting to investigate what kinds of crimes might have been committed so far away from the stations, and why white people stand out as being arrested at that distance, especially in the east.
#### David Kaminsky
What question are you asking?: What is frequency of an offense during each hour of the day?
What is the code you use to answer it?:
```{r DavidKaminsky}
results=table(substr(arrest_tab$arrestTime, 0, 2),gsub("-.*$", "", arrest_tab$incidentOffense))
# Then, to see the distribution for a specific hour, replace HOUR with a number 1-24:
barplot(head(results,1), las=2)
barplot(head(results,6), las=2)
barplot(head(results,12), las=2)
barplot(head(results,18), las=2)
barplot(head(results,24), las=2)
```
What did you observe?
I observed that unknown crime is most common, followed by narcotics and assult. If you look at the graphs individually, at (military times) 0100, 0600, 1200, 1800, and 2400, you will see that the types of arrests most prominently shown are either unknown, narcotics, or assult.
#### 2208Jay
##### Group Members
* Praneet Puppala
* Jason Rubin
* Michael Younkin
* Haoyue (Jay) Zhang
###### What question are you asking?:
Is there a correlation between arrest location and the locations of vacant
buildings?
###### What is the code you use to answer it?:
```{r 2208Jay}
# Read in the larger arrests dataset
recentArrests <- read.csv("full_BPD_Arrests.csv", header=TRUE)
# Convert the ArrestDate column
recentArrests$ArrestDate <- as.Date(recentArrests$ArrestDate, "%m/%d/%Y")
# Remove old arrests and arrests without reported locations
recentArrests <- subset(recentArrests, recentArrests$ArrestDate>="2015-01-01")
recentArrests <- subset(recentArrests, recentArrests$Location.1 != "")
# Read in the vacant buildings dataset
vacantBldgs = read.csv("Vacant_Buildings.csv", stringsAsFactors=FALSE,
header=TRUE)
# Remove rows without a location
vacantBldgs <- subset(vacantBldgs, vacantBldgs$Location.1 != "")
# Extract lat and lon into numeric columns
tmp = vacantBldgs$Location.1
tmp=gsub("\\)","",gsub("\\(","",vacantBldgs$Location.1))
tmp=strsplit(tmp, split=", ")
vacantBldgs$lon=as.numeric(sapply(tmp,function(x) x[2]))
vacantBldgs$lat=as.numeric(sapply(tmp,function(x) x[1]))
tmp = recentArrests$Location.1
tmp=gsub("\\)","",gsub("\\(","",recentArrests$Location.1))
tmp=strsplit(tmp, split=", ")
recentArrests$lon=as.numeric(sapply(tmp,function(x) x[2]))
recentArrests$lat=as.numeric(sapply(tmp,function(x) x[1]))
balto_map = subset(map_data("county", region="maryland"),subregion=="baltimore city")
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=vacantBldgs,aes(x=lon,y=lat),color="blue",alpha=.1)
plt=plt+geom_point(data=recentArrests,aes(x=lon,y=lat),color="red",alpha=.1)
print(plt)
```
##### What did you observe?
Based on the plot we produced, we observe that arrests are more concentrated
around areas with vacant buildings.
Note: because there were no dates associated with when buildings became vacant
(the data is updated twice a month), we used arrest records from the month of
January 2015 to minimize the number of "extra" vacant buildings. We assume that
there was not a significant number of new vacant buildings added since the month
of January.
#### Zach Jiroun, Jose Zamora, Des Chandhok
What question are you asking?:
What is the female to male ratio of different types of crimes on Valentine's Day, 2011?
```{r ZachJiroun}
library(plyr)
vdaytab <- table(incident=arrest_tab$incidentOffense,sex=arrest_tab$sex,date=arrest_tab$arrestDate=="02/14/2011")
vdaytab <- vdaytab[,,-1]
vdaydf = as.data.frame.matrix(vdaytab)
vdaydf$ratio = vdaydf$F/vdaydf$M
vdaydf$ratio <- round(vdaydf$ratio, 3)
vdaydf <- subset(vdaydf, M > 0 | F > 0)
vdaydf[-1]
vdaydf$offense <- c("111", "112", "115", "23", "24", "26", "3K", "3P", "4B", "4C", "4E", "54", "5D", "6C", "79", "7A", "87", "87O", "97", "Unknown")
vdaydf$offense_description <- c("(111) Protective Order", "(112) Traffic Related Incident", "(115) Trespassing", "(23) Unauthorized Use", "(24) Towed Vehicle", "(26) Recovered Vehicle", "(3K) Robb Res. (Ua)", "(3P) Robb Misc. (Ua)", "(4B) Agg. Asslt.- Cut", "(4C) Agg. Asslt.- Oth.", "(4E) Common Assault", "(54) Armed Person", "(5D) Burg. Oth. (Force)", "(6C) Larceny- Shoplifting", "(79) Other", "(7A) Stolen Auto", "(87) Narcotics", "(87O)Narcotics (Outside)", "(97) Search & Seizure", "Unknown Offense")
ggplot(data=vdaydf, aes(x=offense, y=ratio, fill=offense_description)) + geom_bar(stat="identity") + ggtitle("Valentine's Day 2011 F/M Crime Ratio") + theme(axis.text.x = element_text(angle=90))
# Removing 0 and Inf ratios
v <- subset(vdaydf, ratio > 0 & ratio < Inf)
v[-1]
v$offense <- c("4B", "4E", "79", "87O", "97", "Unknown")
v$offense_description <- c("Agg. Asslt.- Cut", "Common Assault", "Other", "Narcotics (Outside)", "Search & Seizure", "Unknown Offense")
ggplot(data=v, aes(x=offense, y=ratio, fill=offense_description)) + geom_bar(stat="identity") + ggtitle("Valentine's Day 2011 F/M Crime Ratio")
```
What did you observe? We were surprised to find that there were significantly more Male arrests than Female arrests. In the categories where Males and Females were arrested for the same crime, only about 1/4 of them were Female. We also thought it was interesting that there was only 1 Female Narcotics related arrest. Whether or not 17 Narcotics related Male arrests were related to Valentine's Day remains to be seen.
###Klar Kuo
What are you asking?
What types of crimes have increased and decreased in frequency?
What is the code you use to answer it?
```{r cheesepuff18}
library(ggplot2)
library(stringr)
require(data.table)
# Filter the list so that it only includes types of offenses that have happened over 500 times
# (So that it has enough data to be meaningful)
arrest_type <- table(arrest_tab$incidentOffense)
arrest_type[arrest_type >= 500]
arrest_type_list <- names(arrest_type[arrest_type >= 500])
arrest_type_list
# All the arrests for those types that we singled out
arrest_dates <- arrest_tab[arrest_tab$incidentOffense %in% arrest_type_list ,]
# Set all the counts to one so we can sum them when we aggregate
arrest_table <- data.table(arrest_dates)
arrest_table <- arrest_table[, count := 1, by=arrestDate]
# Aggregate all the data together by how many times a type of incident occured on each day
arrest_table <- setNames(aggregate(arrest_table$count, list(arrest_table$incidentOffense, arrest_table$arrestDate), sum), c("incidentOffense","arrestDate","count"))
# Graph it
ggplot(data=arrest_table, aes(x=arrestDate, y=count, group=incidentOffense, colour=incidentOffense)) +
geom_line() +
geom_point()
# The last graph has too many data points crammed in, so we aggregate the data by month and graph it again
# Set a shortdate column with a simple month and year so the system can aggregate them on it
arrest_table$shortdate <- strftime(as.Date(arrest_table$arrestDate, "%m/%d/%Y"), format="%y/%m")
arrest_table <- setNames(aggregate(arrest_table$count, list(arrest_table$incidentOffense, arrest_table$shortdate), sum), c("incidentOffense","arrestMonth","count"))
ggplot(data=arrest_table, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
geom_line() +
geom_point()
# Lets take a look at the graph without the highest two types of occurences, because those two are so much higher
low_arrests <- arrest_table[arrest_table$incidentOffense!="87-Narcotics",]
low_arrests <- low_arrests[low_arrests$incidentOffense!="Unknown Offense",]
ggplot(data=low_arrests, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
geom_line() +
geom_point()
# Take a closer look at the graph from the most common type of occurence that's now "Unknown"
# (87-Narcotics)
high_arrests <- arrest_table[arrest_table$incidentOffense=="87-Narcotics",]
high_arrests
ggplot(data=high_arrests, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
geom_line() +
geom_point()
# Now lets split this into two lines, one for each year
high_arrests$year <- str_split_fixed(high_arrests$arrestMonth, "/", 2)[,1]
high_arrests$month <- str_split_fixed(high_arrests$arrestMonth, "/", 2)[,2]
ggplot(data=high_arrests, aes(x=month, y=count, group=year, colour=year)) +
geom_line() +
geom_point()
```
What did you observe?
Between the beginning of 2011 and the end of 2012, crime rates fluctuated quite a bit, but seemed to end up at around the same level. To be more specific, the crime rates for the most popular types of crimes all seemed to increase in frequency between the two dates, but then drop back to around what it started at.
After looking at graphs for each year specifically, it looks like crime rises towards the middle of the year and then dies down towards winter. This could be due to the weather or perhaps the holiday scheduling of the police (or of criminal activity).
However, it does look like the rate of crime for this time period increased right in the middle. In the year-separated-graph for Narcotics, the end of 2011 had clearly higher levels than the start of 2011, and the start of 2012 was higher than the end of 2012. This could be attributed to lower levels of employment in that time
#### Albert Koy
#####What question are you asking?
Do people commit fewer crimes during the holidays?
#####What is the code you use to answer it?
```{r Albert}
(function() {
source("calendarHeat_modified.R")
arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
arrest_counts_tab <- table(arrest_tab$arrestDate)
arrest_counts_df <- data.frame(date=as.Date(names(arrest_counts_tab), "%m/%d/%Y"), num_arrests=as.vector(arrest_counts_tab))
calendarHeat(arrest_counts_df$date, arrest_counts_df$num_arrests, color="g2r", varname="Number of Arrests Per Day")
})()
```
#####What did you observe?
Looking at the heat map for 2011 and 2012, there is a very distinct reduction in number of arrests on Thanksgiving (the fourth Thursday of November). Moreover, on Christmas (the last Sunday of December in 2011 and the last Tuesday of December in 2012) and the days leading up to Christmas, we observe green squares as well. It's hard to tell whether this reduction in arrests is due to fewer officers working on the holidays or due to fewer crimes being committed. In making this judgement, it would be useful to have data on the number of officers working each day.