forked from DataTalksClub/machine-learning-zoomcamp
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f2f9c41
commit 9cf72a8
Showing
55 changed files
with
1,941 additions
and
1,839 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,104 +1,4 @@ | ||
## Session #1 Homework | ||
## Homework | ||
|
||
> **Solution**: [homework-1.ipynb](homework-1.ipynb). | ||
### Set up the environment | ||
|
||
You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from [06-environment.md](06-environment.md). | ||
|
||
### Question 1 | ||
|
||
What's the version of NumPy that you installed? | ||
|
||
You can get the version information using the `__version__` field: | ||
|
||
```python | ||
np.__version__ | ||
``` | ||
|
||
### Question 2 | ||
|
||
What's the version of Pandas? | ||
|
||
|
||
### Getting the data | ||
|
||
For this homework, we'll use the same dataset as for the next session - the car price dataset. | ||
|
||
Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv). | ||
|
||
You can do it with wget: | ||
|
||
```bash | ||
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv | ||
``` | ||
|
||
Or just open it with your browser and click "Save as...". | ||
|
||
Now read it with Pandas. | ||
|
||
|
||
### Question 3 | ||
|
||
What's the average price of BMW cars in the dataset? | ||
|
||
|
||
### Question 4 | ||
|
||
Select a subset of cars after year 2015 (inclusive, i.e. 2015 and after). How many of them have missing values for Engine HP? | ||
|
||
|
||
### Question 5 | ||
|
||
* Calculate the average "Engine HP" in the dataset. | ||
* Use the `fillna` method and to fill the missing values in "Engine HP" with the mean value from the previous step. | ||
* Now, calcualte the average of "Engine HP" again. | ||
* Has it changed? | ||
|
||
Round both means before answering this questions. You can use the `round` function for that: | ||
|
||
```python | ||
print(round(mean_hp_before)) | ||
print(round(mean_hp_after)) | ||
``` | ||
|
||
|
||
### Question 6 | ||
|
||
* Select all the "Rolls-Royce" cars from the dataset. | ||
* Select only columns "Engine HP", "Engine Cylinders", "highway MPG". | ||
* Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 7 rows). | ||
* Get the underlying NumPy array. Let's call it `X`. | ||
* Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`. | ||
* Invert `XTX`. | ||
* What's the sum of all the elements of the result? | ||
|
||
Hint: if the result is negative, re-read the task one more time | ||
|
||
|
||
### Questions 7 | ||
|
||
* Create an array `y` with values `[1000, 1100, 900, 1200, 1000, 850, 1300]`. | ||
* Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`. | ||
* What's the value of the first element of `w`?. | ||
|
||
> **Note**: You just implemented linear regression. We'll talk about it in the next lesson. | ||
|
||
## Submit the results | ||
|
||
Submit your results here: https://forms.gle/aiunQqRtqcay8Wwo9. | ||
|
||
If your answer doesn't match options exactly, select the closest one. | ||
|
||
|
||
## Deadline | ||
|
||
The deadline for submitting is 13 September 2021, 17:00 CET. After that, the form will be closed. | ||
|
||
|
||
## Navigation | ||
|
||
* [Machine Learning Zoomcamp course](../) | ||
* [Lesson 1: Introduction to Machine Learning](./) | ||
* Previous: [Summary](10-summary.md) | ||
* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/) | ||
* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,115 +1,4 @@ | ||
## 2.18 Homework | ||
## Homework | ||
|
||
Solution: [homework.ipynb](homework.ipynb) | ||
|
||
### Dataset | ||
|
||
In this homework, we will use the New York City Airbnb Open Data. You can take it from | ||
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv) | ||
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv) | ||
if you don't want to sign up to Kaggle. | ||
|
||
The goal of this homework is to create a regression model for prediction apartment prices (column `'price'`). | ||
|
||
### EDA | ||
|
||
* Load the data. | ||
* Look at the `price` variable. Does it have a long tail? | ||
|
||
### Features | ||
|
||
For the rest of the homework, you'll need to use only these columns: | ||
|
||
* `'latitude'`, | ||
* `'longitude'`, | ||
* `'price'`, | ||
* `'minimum_nights'`, | ||
* `'number_of_reviews'`, | ||
* `'reviews_per_month'`, | ||
* `'calculated_host_listings_count'`, | ||
* `'availability_365'` | ||
|
||
Select only them. | ||
|
||
### Question 1 | ||
|
||
Find a feature with missing values. How many missing values does it have? | ||
|
||
|
||
### Question 2 | ||
|
||
What's the median (50% percentile) for variable 'minimum_nights'? | ||
|
||
|
||
### Split the data | ||
|
||
* Shuffle the initial dataset, use seed `42`. | ||
* Split your data in train/val/test sets, with 60%/20%/20% distribution. | ||
* Make sure that the target value ('price') is not in your dataframe. | ||
* Apply the log transformation to the price variable using the `np.log1p()` function. | ||
|
||
|
||
### Question 3 | ||
|
||
* We need to deal with missing values for the column from Q1. | ||
* We have two options: fill it with 0 or with the mean of this variable. | ||
* Try both options. For each, train a linear regression model without regularization using the code from the lessons. | ||
* For computing the mean, use the training only! | ||
* Use the validation dataset to evaluate the models and compare the RMSE of each option. | ||
* Round the RMSE scores to 2 decimal digits using `round(score, 2)` | ||
* Which option gives better RMSE? | ||
|
||
|
||
### Question 4 | ||
|
||
* Now let's train a regularized linear regression. | ||
* For this question, fill the NAs with 0. | ||
* Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`. | ||
* Use RMSE to evaluate the model on the validation dataset. | ||
* Round the RMSE scores to 2 decimal digits. | ||
* Which `r` gives the best RMSE? | ||
|
||
If there are multiple options, select the smallest `r`. | ||
|
||
|
||
### Question 5 | ||
|
||
* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score. | ||
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`. | ||
* For each seed, do the train/validation/test split with 60%/20%/20% distribution. | ||
* Fill the missing values with 0 and train a model without regularization. | ||
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. | ||
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`. | ||
* Round the result to 3 decimal digits (`round(std, 3)`) | ||
|
||
|
||
> Note: Standard deviation shows how different the values are. | ||
> If it's low, then all values are approximately the same. | ||
> If it's high, the values are different. | ||
> If standard deviation of scores is low, then our model is *stable*. | ||
|
||
### Question 6 | ||
|
||
* Split the dataset like previously, use seed 9. | ||
* Combine train and validation datasets. | ||
* Fill the missing values with 0 and train a model with `r=0.001`. | ||
* What's the RMSE on the test dataset? | ||
|
||
|
||
## Submit the results | ||
|
||
Submit your results here: https://forms.gle/2N9GkTr1AgNeZ8hD7. | ||
|
||
If your answer doesn't match options exactly, select the closest one. | ||
|
||
## Deadline | ||
|
||
|
||
The deadline for submitting is 20 September 2021, 17:00 CET. After that, the form will be closed. | ||
|
||
## Navigation | ||
|
||
* [Machine Learning Zoomcamp course](../) | ||
* [Session 2: Machine Learning for Regression](./) | ||
* Previous: [Explore more](17-explore-more.md) | ||
* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/) | ||
* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,120 +1,4 @@ | ||
## 3.15 Homework | ||
## Homework | ||
|
||
### Dataset | ||
|
||
In this homework, we will continue the New York City Airbnb Open Data. You can take it from | ||
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv) | ||
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv) | ||
if you don't want to sign up to Kaggle. | ||
|
||
We'll keep working with the `'price'` variable, and we'll transform it to a classification task. | ||
|
||
|
||
### Features | ||
|
||
For the rest of the homework, you'll need to use the features from the previous homework with additional two `'neighbourhood_group'` and `'room_type'`. So the whole feature set will be set as follows: | ||
|
||
* `'neighbourhood_group'`, | ||
* `'room_type'`, | ||
* `'latitude'`, | ||
* `'longitude'`, | ||
* `'price'`, | ||
* `'minimum_nights'`, | ||
* `'number_of_reviews'`, | ||
* `'reviews_per_month'`, | ||
* `'calculated_host_listings_count'`, | ||
* `'availability_365'` | ||
|
||
Select only them and fill in the missing values with 0. | ||
|
||
|
||
### Question 1 | ||
|
||
What is the most frequent observation (mode) for the column `'neighbourhood_group'`? | ||
|
||
|
||
### Split the data | ||
|
||
* Split your data in train/val/test sets, with 60%/20%/20% distribution. | ||
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42. | ||
* Make sure that the target value ('price') is not in your dataframe. | ||
|
||
|
||
### Question 2 | ||
|
||
* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset. | ||
* In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset. | ||
* What are the two features that have the biggest correlation in this dataset? | ||
|
||
Example of a correlation matrix for the car price dataset: | ||
|
||
<img src="images/correlation-matrix.png" /> | ||
|
||
|
||
### Make price binary | ||
|
||
* We need to turn the price variable from numeric into binary. | ||
* Let's create a variable `above_average` which is `1` if the price is above (or equal to) `152`. | ||
|
||
|
||
### Question 3 | ||
|
||
* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only. | ||
* Which of these two variables has bigger score? | ||
* Round it to 2 decimal digits using `round(score, 2)` | ||
|
||
|
||
### Question 4 | ||
|
||
* Now let's train a logistic regression | ||
* Remember that we have two categorical variables in the data. Include them using one-hot encoding. | ||
* Fit the model on the training dataset. | ||
* To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters: | ||
* `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)` | ||
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits. | ||
|
||
|
||
### Question 5 | ||
|
||
* We have 9 features: 7 numerical features and 2 categorical. | ||
* Let's find the least useful one using the *feature elimination* technique. | ||
* Train a model with all these features (using the same parameters as in Q4). | ||
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model. | ||
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. | ||
* Which of following feature has the smallest difference? | ||
* `neighbourhood_group` | ||
* `room_type` | ||
* `number_of_reviews` | ||
* `reviews_per_month` | ||
|
||
> **note**: the difference doesn't have to be positive | ||
|
||
### Question 6 | ||
|
||
* For this question, we'll see how to use a linear regression model from Scikit-Learn | ||
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column. | ||
* Fit the Ridge regression model on the training data. | ||
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]` | ||
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits. | ||
|
||
If there are multiple options, select the smallest `alpha`. | ||
|
||
|
||
## Submit the results | ||
|
||
Submit your results here: https://forms.gle/xGpZhoq9Efm9E4RA9 | ||
|
||
It's possible that your answers won't match exactly. If it's the case, select the closest one. | ||
|
||
|
||
## Deadline | ||
|
||
The deadline for submitting is 27 September 2021, 17:00 CET. After that, the form will be closed. | ||
|
||
|
||
## Navigation | ||
|
||
* [Machine Learning Zoomcamp course](../) | ||
* [Session 3: Machine Learning for Classification](./) | ||
* Previous: [Explore more](14-explore-more.md) | ||
* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/) | ||
* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/) |
Oops, something went wrong.