Skip to content

Commit

Permalink
preparing for 2022 cohort
Browse files Browse the repository at this point in the history
  • Loading branch information
alexeygrigorev committed Aug 29, 2022
1 parent f2f9c41 commit 9cf72a8
Show file tree
Hide file tree
Showing 55 changed files with 1,941 additions and 1,839 deletions.
106 changes: 3 additions & 103 deletions course-zoomcamp/01-intro/homework.md
Original file line number Diff line number Diff line change
@@ -1,104 +1,4 @@
## Session #1 Homework
## Homework

> **Solution**: [homework-1.ipynb](homework-1.ipynb).
### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from [06-environment.md](06-environment.md).

### Question 1

What's the version of NumPy that you installed?

You can get the version information using the `__version__` field:

```python
np.__version__
```

### Question 2

What's the version of Pandas?


### Getting the data

For this homework, we'll use the same dataset as for the next session - the car price dataset.

Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.


### Question 3

What's the average price of BMW cars in the dataset?


### Question 4

Select a subset of cars after year 2015 (inclusive, i.e. 2015 and after). How many of them have missing values for Engine HP?


### Question 5

* Calculate the average "Engine HP" in the dataset.
* Use the `fillna` method and to fill the missing values in "Engine HP" with the mean value from the previous step.
* Now, calcualte the average of "Engine HP" again.
* Has it changed?

Round both means before answering this questions. You can use the `round` function for that:

```python
print(round(mean_hp_before))
print(round(mean_hp_after))
```


### Question 6

* Select all the "Rolls-Royce" cars from the dataset.
* Select only columns "Engine HP", "Engine Cylinders", "highway MPG".
* Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 7 rows).
* Get the underlying NumPy array. Let's call it `X`.
* Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
* Invert `XTX`.
* What's the sum of all the elements of the result?

Hint: if the result is negative, re-read the task one more time


### Questions 7

* Create an array `y` with values `[1000, 1100, 900, 1200, 1000, 850, 1300]`.
* Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
* What's the value of the first element of `w`?.

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

## Submit the results

Submit your results here: https://forms.gle/aiunQqRtqcay8Wwo9.

If your answer doesn't match options exactly, select the closest one.


## Deadline

The deadline for submitting is 13 September 2021, 17:00 CET. After that, the form will be closed.


## Navigation

* [Machine Learning Zoomcamp course](../)
* [Lesson 1: Introduction to Machine Learning](./)
* Previous: [Summary](10-summary.md)
* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/)
* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/)
117 changes: 3 additions & 114 deletions course-zoomcamp/02-regression/homework.md
Original file line number Diff line number Diff line change
@@ -1,115 +1,4 @@
## 2.18 Homework
## Homework

Solution: [homework.ipynb](homework.ipynb)

### Dataset

In this homework, we will use the New York City Airbnb Open Data. You can take it from
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

The goal of this homework is to create a regression model for prediction apartment prices (column `'price'`).

### EDA

* Load the data.
* Look at the `price` variable. Does it have a long tail?

### Features

For the rest of the homework, you'll need to use only these columns:

* `'latitude'`,
* `'longitude'`,
* `'price'`,
* `'minimum_nights'`,
* `'number_of_reviews'`,
* `'reviews_per_month'`,
* `'calculated_host_listings_count'`,
* `'availability_365'`

Select only them.

### Question 1

Find a feature with missing values. How many missing values does it have?


### Question 2

What's the median (50% percentile) for variable 'minimum_nights'?


### Split the data

* Shuffle the initial dataset, use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Make sure that the target value ('price') is not in your dataframe.
* Apply the log transformation to the price variable using the `np.log1p()` function.


### Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?


### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0.
* Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.


### Question 5

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)


> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different.
> If standard deviation of scores is low, then our model is *stable*.

### Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`.
* What's the RMSE on the test dataset?


## Submit the results

Submit your results here: https://forms.gle/2N9GkTr1AgNeZ8hD7.

If your answer doesn't match options exactly, select the closest one.

## Deadline


The deadline for submitting is 20 September 2021, 17:00 CET. After that, the form will be closed.

## Navigation

* [Machine Learning Zoomcamp course](../)
* [Session 2: Machine Learning for Regression](./)
* Previous: [Explore more](17-explore-more.md)
* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/)
* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/)
122 changes: 3 additions & 119 deletions course-zoomcamp/03-classification/homework.md
Original file line number Diff line number Diff line change
@@ -1,120 +1,4 @@
## 3.15 Homework
## Homework

### Dataset

In this homework, we will continue the New York City Airbnb Open Data. You can take it from
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

We'll keep working with the `'price'` variable, and we'll transform it to a classification task.


### Features

For the rest of the homework, you'll need to use the features from the previous homework with additional two `'neighbourhood_group'` and `'room_type'`. So the whole feature set will be set as follows:

* `'neighbourhood_group'`,
* `'room_type'`,
* `'latitude'`,
* `'longitude'`,
* `'price'`,
* `'minimum_nights'`,
* `'number_of_reviews'`,
* `'reviews_per_month'`,
* `'calculated_host_listings_count'`,
* `'availability_365'`

Select only them and fill in the missing values with 0.


### Question 1

What is the most frequent observation (mode) for the column `'neighbourhood_group'`?


### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value ('price') is not in your dataframe.


### Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
* In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

Example of a correlation matrix for the car price dataset:

<img src="images/correlation-matrix.png" />


### Make price binary

* We need to turn the price variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the price is above (or equal to) `152`.


### Question 3

* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`


### Question 4

* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
* To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
* `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.


### Question 5

* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
* Which of following feature has the smallest difference?
* `neighbourhood_group`
* `room_type`
* `number_of_reviews`
* `reviews_per_month`

> **note**: the difference doesn't have to be positive

### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.


## Submit the results

Submit your results here: https://forms.gle/xGpZhoq9Efm9E4RA9

It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Deadline

The deadline for submitting is 27 September 2021, 17:00 CET. After that, the form will be closed.


## Navigation

* [Machine Learning Zoomcamp course](../)
* [Session 3: Machine Learning for Classification](./)
* Previous: [Explore more](14-explore-more.md)
* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/)
* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/)
Loading

0 comments on commit 9cf72a8

Please sign in to comment.