-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbiodiversity.py
392 lines (238 loc) · 12.7 KB
/
biodiversity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
# coding: utf-8
# # Capstone 2: Biodiversity Project
# # Introduction
# You are a biodiversity analyst working for the National Parks Service. You're going to help them analyze some data about species at various national parks.
#
# Note: The data that you'll be working with for this project is *inspired* by real data, but is mostly fictional.
# # Step 1
# Import the modules that you'll be using in this assignment:
# - `from matplotlib import pyplot as plt`
# - `import pandas as pd`
# In[243]:
from matplotlib import pyplot as plt
import pandas as pd
# # Step 2
# You have been given two CSV files. `species_info.csv` with data about different species in our National Parks, including:
# - The scientific name of each species
# - The common names of each species
# - The species conservation status
#
# Load the dataset and inspect it:
# - Load `species_info.csv` into a DataFrame called `species`
# In[244]:
species = pd.read_csv('species_info.csv')
# Inspect each DataFrame using `.head()`.
# In[245]:
print(species.head())
# # Step 3
# Let's start by learning a bit more about our data. Answer each of the following questions.
# How many different species are in the `species` DataFrame?
# In[246]:
print ("number of species : "+str(species.scientific_name.nunique()))
print ("number of animals : "+str(species.scientific_name.count()))
### The second gives us the number of rows in the csv (5824) but animals with the same scientific_name belong to the same species for me.
# What are the different values of `category` in `species`?
# In[247]:
print (species.category.unique())
# What are the different values of `conservation_status`?
# In[248]:
print (species.conservation_status.unique())
# # Step 4
# Let's start doing some analysis!
#
# The column `conservation_status` has several possible values:
# - `Species of Concern`: declining or appear to be in need of conservation
# - `Threatened`: vulnerable to endangerment in the near future
# - `Endangered`: seriously at risk of extinction
# - `In Recovery`: formerly `Endangered`, but currnetly neither in danger of extinction throughout all or a significant portion of its range
#
# We'd like to count up how many species meet each of these criteria. Use `groupby` to count how many `scientific_name` meet each of these criteria.
# In[249]:
print (species.groupby("conservation_status").scientific_name.nunique())
# As we saw before, there are far more than 200 species in the `species` table. Clearly, only a small number of them are categorized as needing some sort of protection. The rest have `conservation_status` equal to `None`. Because `groupby` does not include `None`, we will need to fill in the null values. We can do this using `.fillna`. We pass in however we want to fill in our `None` values as an argument.
#
# Paste the following code and run it to see replace `None` with `No Intervention`:
# ```python
# species.fillna('No Intervention', inplace=True)
# ```
# In[250]:
species.fillna('No Intervention', inplace=True)
# Great! Now run the same `groupby` as before to see how many species require `No Intervention`.
# In[251]:
print (species.groupby("conservation_status").scientific_name.count())
# Let's use `plt.bar` to create a bar chart. First, let's sort the columns by how many species are in each categories. We can do this using `.sort_values`. We use the the keyword `by` to indicate which column we want to sort by.
#
# Paste the following code and run it to create a new DataFrame called `protection_counts`, which is sorted by `scientific_name`:
# ```python
# protection_counts = species.groupby('conservation_status')\
# .scientific_name.count().reset_index()\
# .sort_values(by='scientific_name')
# ```
# In[252]:
protection_counts = species.groupby('conservation_status') .scientific_name.count().reset_index() .sort_values(by='scientific_name')
# Now let's create a bar chart!
# 1. Start by creating a wide figure with `figsize=(10, 4)`
# 1. Start by creating an axes object called `ax` using `plt.subplot`.
# 2. Create a bar chart whose heights are equal to `scientific_name` column of `protection_counts`.
# 3. Create an x-tick for each of the bars.
# 4. Label each x-tick with the label from `conservation_status` in `protection_counts`
# 5. Label the y-axis `Number of Species`
# 6. Title the graph `Conservation Status by Species`
# 7. Plot the grap using `plt.show()`
# In[253]:
plt.figure(figsize=(10,4))
plt.bar(range(len(protection_counts.scientific_name)),protection_counts.scientific_name)
ax = plt.subplot()
ax.set_xticks(range(len(protection_counts.scientific_name)))
ax.set_xticklabels(protection_counts.conservation_status)
plt.ylabel("Number of Species")
plt.title("Conservation Status by Species")
plt.show()
# # Step 4
# Are certain types of species more likely to be endangered?
# Let's create a new column in `species` called `is_protected`, which is `True` if `conservation_status` is not equal to `No Intervention`, and `False` otherwise.
# In[254]:
species["is_protected"]=species.conservation_status != "No Intervention"
### atlernative : species["is_protected"]=species.conservation_status.apply(lambda a: False if a=="No Intervention" else True)
# Let's group by *both* `category` and `is_protected`. Save your results to `category_counts`.
# In[255]:
category_counts = species.groupby(["category","is_protected"]).scientific_name.nunique().reset_index()
# Examine `category_counts` using `head()`.
# In[256]:
print (category_counts.head())
# It's going to be easier to view this data if we pivot it. Using `pivot`, rearange `category_counts` so that:
# - `columns` is `is_protected`
# - `index` is `category`
# - `values` is `scientific_name`
#
# Save your pivoted data to `category_pivot`. Remember to `reset_index()` at the end.
# In[257]:
category_pivot = category_counts.pivot(index="category",columns="is_protected",values = "scientific_name").reset_index()
# Examine `category_pivot`.
# In[258]:
category_pivot
# Use the `.columns` property to rename the categories `True` and `False` to something more description:
# - Leave `category` as `category`
# - Rename `False` to `not_protected`
# - Rename `True` to `protected`
# In[259]:
category_pivot.columns = ['category','not_protected','protected']
# Let's create a new column of `category_pivot` called `percent_protected`, which is equal to `protected` (the number of species that are protected) divided by `protected` plus `not_protected` (the total number of species).
# In[260]:
category_pivot['percent_protected'] = category_pivot.protected / (category_pivot.protected + category_pivot.not_protected)
# Examine `category_pivot`.
# In[261]:
category_pivot
# It looks like species in category `Mammal` are more likely to be endangered than species in `Bird`. We're going to do a significance test to see if this statement is true. Before you do the significance test, consider the following questions:
# - Is the data numerical or categorical?
# - How many pieces of data are you comparing?
# Based on those answers, you should choose to do a *chi squared test*. In order to run a chi squared test, we'll need to create a contingency table. Our contingency table should look like this:
#
# ||protected|not protected|
# |-|-|-|
# |Mammal|?|?|
# |Bird|?|?|
#
# Create a table called `contingency` and fill it in with the correct numbers
# In[262]:
contingency = [[30, 146],[75,413]]
contingency
# In order to perform our chi square test, we'll need to import the correct function from scipy. Past the following code and run it:
# ```py
# from scipy.stats import chi2_contingency
# ```
# In[263]:
from scipy.stats import chi2_contingency
# Now run `chi2_contingency` with `contingency`.
# In[264]:
(_,pval,_,_) = chi2_contingency (contingency)
print (pval)
# It looks like this difference isn't significant!
#
# Let's test another. Is the difference between `Reptile` and `Mammal` significant?
# In[285]:
contingency2 = [[30, 146],[5,73]]
(_,pval2,_,_) = chi2_contingency (contingency2)
print ("mammal-reptile:")
print(pval2)
contingency3 = [[30, 146],[7,72]]
(_,pval3,_,_) = chi2_contingency (contingency3)
print ("mammal-amphibian:")
print(pval3)
contingency4 = [[30, 146],[11,115]]
(_,pval4,_,_) = chi2_contingency (contingency4)
print ("mammal-fish:")
print(pval4)
# Yes! It looks like there is a significant difference between `Reptile` and `Mammal`!
# # Step 5
# Conservationists have been recording sightings of different species at several national parks for the past 7 days. They've saved sent you their observations in a file called `observations.csv`. Load `observations.csv` into a variable called `observations`, then use `head` to view the data.
# In[266]:
observations = pd.read_csv('observations.csv')
observations.head()
# Some scientists are studying the number of sheep sightings at different national parks. There are several different scientific names for different types of sheep. We'd like to know which rows of `species` are referring to sheep. Notice that the following code will tell us whether or not a word occurs in a string:
# In[267]:
# Does "Sheep" occur in this string?
str1 = 'This string contains Sheep'
'Sheep' in str1
# In[268]:
# Does "Sheep" occur in this string?
str2 = 'This string contains Cows'
'Sheep' in str2
# Use `apply` and a `lambda` function to create a new column in `species` called `is_sheep` which is `True` if the `common_names` contains `'Sheep'`, and `False` otherwise.
# In[269]:
species['is_sheep'] = species.common_names.apply(lambda x: 'Sheep' in x)
# Select the rows of `species` where `is_sheep` is `True` and examine the results.
# In[270]:
species.loc[species.is_sheep==True]
# In[ ]:
# Many of the results are actually plants. Select the rows of `species` where `is_sheep` is `True` and `category` is `Mammal`. Save the results to the variable `sheep_species`.
# In[271]:
sheep_species = species.loc[(species.is_sheep==True) & (species.category=='Mammal')]
sheep_species
# Now merge `sheep_species` with `observations` to get a DataFrame with observations of sheep. Save this DataFrame as `sheep_observations`.
# In[272]:
sheep_observations = pd.merge(sheep_species, observations, how = 'left')
sheep_observations
# How many total sheep observations (across all three species) were made at each national park? Use `groupby` to get the `sum` of `observations` for each `park_name`. Save your answer to `obs_by_park`.
#
# This is the total number of sheep observed in each park over the past 7 days.
# In[273]:
obs_by_park = sheep_observations.groupby('park_name').observations.sum().reset_index()
print(obs_by_park)
# Create a bar chart showing the different number of observations per week at each park.
#
# 1. Start by creating a wide figure with `figsize=(16, 4)`
# 1. Start by creating an axes object called `ax` using `plt.subplot`.
# 2. Create a bar chart whose heights are equal to `observations` column of `obs_by_park`.
# 3. Create an x-tick for each of the bars.
# 4. Label each x-tick with the label from `park_name` in `obs_by_park`
# 5. Label the y-axis `Number of Observations`
# 6. Title the graph `Observations of Sheep per Week`
# 7. Plot the grap using `plt.show()`
# In[274]:
plt.figure(figsize=(16,4))
plt.bar(range(len(obs_by_park.observations)),obs_by_park.observations)
ax = plt.subplot()
ax.set_xticks(range(len(obs_by_park.park_name)))
ax.set_xticklabels(obs_by_park.park_name)
plt.ylabel("Number of Observations")
plt.title("Observations of Sheep per Week")
plt.show()
# Our scientists know that 15% of sheep at Bryce National Park have foot and mouth disease. Park rangers at Yellowstone National Park have been running a program to reduce the rate of foot and mouth disease at that park. The scientists want to test whether or not this program is working. They want to be able to detect reductions of at least 5 percentage points. For instance, if 10% of sheep in Yellowstone have foot and mouth disease, they'd like to be able to know this, with confidence.
#
# Use <a href="https://s3.amazonaws.com/codecademy-content/courses/learn-hypothesis-testing/a_b_sample_size/index.html">Codecademy's sample size calculator</a> to calculate the number of sheep that they would need to observe from each park. Use the default level of significance (90%).
#
# Remember that "Minimum Detectable Effect" is a percent of the baseline.
# In[275]:
### Baseline conversion rate = 15%
### Statistical significance = 90%
### Minimum detectable effect = 5/15 = 33%
sample_size = 890.0 ### according to calculator
# How many weeks would you need to observe sheep at Bryce National Park in order to observe enough sheep? How many weeks would you need to observe at Yellowstone National Park to observe enough sheep?
# In[276]:
obs_by_park['week'] = sample_size / obs_by_park['observations']
print (obs_by_park)
# In[277]:
weeks_Bryce = 4
weeks_Yellowstone = 2
# In[ ]: