preparing for 2022 cohort

levankankadze · Aug 29, 2022 · 9cf72a8 · 9cf72a8
1 parent f2f9c41
commit 9cf72a8
Show file tree

Hide file tree

Showing 55 changed files with 1,941 additions and 1,839 deletions.
diff --git a/course-zoomcamp/01-intro/homework.md b/course-zoomcamp/01-intro/homework.md
@@ -1,104 +1,4 @@
-## Session #1 Homework
+## Homework
 
-> **Solution**: [homework-1.ipynb](homework-1.ipynb).
-
-### Set up the environment
-
-You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from [06-environment.md](06-environment.md).
-
-### Question 1
-
-What's the version of NumPy that you installed? 
-
-You can get the version information using the `__version__` field:
-
-```python
-np.__version__
-```
-
-### Question 2
-
-What's the version of Pandas? 
-
-
-### Getting the data 
-
-For this homework, we'll use the same dataset as for the next session - the car price dataset.
-
-Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).
-
-You can do it with wget:
-
-```bash
-wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
-```
-
-Or just open it with your browser and click "Save as...".
-
-Now read it with Pandas. 
-
-
-### Question 3
-
-What's the average price of BMW cars in the dataset?
-
-
-### Question 4
-
-Select a subset of cars after year 2015 (inclusive, i.e. 2015 and after). How many of them have missing values for Engine HP?
-
-
-### Question 5
-
-* Calculate the average "Engine HP" in the dataset. 
-* Use the `fillna` method and to fill the missing values in "Engine HP" with the mean value from the previous step. 
-* Now, calcualte the average of "Engine HP" again.
-* Has it changed? 
-
-Round both means before answering this questions. You can use the `round` function for that:
-
-```python
-print(round(mean_hp_before))
-print(round(mean_hp_after))
-```
-
-
-### Question 6
-
-* Select all the "Rolls-Royce" cars from the dataset.
-* Select only columns "Engine HP", "Engine Cylinders", "highway MPG".
-* Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 7 rows).
-* Get the underlying NumPy array. Let's call it `X`.
-* Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
-* Invert `XTX`.
-* What's the sum of all the elements of the result?
-
-Hint: if the result is negative, re-read the task one more time
-
-
-### Questions 7 
-
-* Create an array `y` with values `[1000, 1100, 900, 1200, 1000, 850, 1300]`.
-* Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
-* What's the value of the first element of `w`?.
-
-> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.
-
-
-## Submit the results
-
-Submit your results here: https://forms.gle/aiunQqRtqcay8Wwo9.
-
-If your answer doesn't match options exactly, select the closest one.
-
-
-## Deadline
-
-The deadline for submitting is 13 September 2021, 17:00 CET. After that, the form will be closed.
-
-
-## Navigation
-
-* [Machine Learning Zoomcamp course](../)
-* [Lesson 1: Introduction to Machine Learning](./)
-* Previous: [Summary](10-summary.md)
+* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/)
+* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/)
diff --git a/course-zoomcamp/02-regression/homework.md b/course-zoomcamp/02-regression/homework.md
@@ -1,115 +1,4 @@
-## 2.18 Homework
+## Homework
 
-Solution: [homework.ipynb](homework.ipynb)
-
-### Dataset
-
-In this homework, we will use the New York City Airbnb Open Data. You can take it from
-[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
-or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
-if you don't want to sign up to Kaggle.
-
-The goal of this homework is to create a regression model for prediction apartment prices (column `'price'`).
-
-### EDA
-
-* Load the data.
-* Look at the `price` variable. Does it have a long tail? 
-
-### Features
-
-For the rest of the homework, you'll need to use only these columns:
-
-* `'latitude'`,
-* `'longitude'`,
-* `'price'`,
-* `'minimum_nights'`,
-* `'number_of_reviews'`,
-* `'reviews_per_month'`,
-* `'calculated_host_listings_count'`,
-* `'availability_365'`
-
-Select only them.
-
-### Question 1
-
-Find a feature with missing values. How many missing values does it have?
-
-
-### Question 2
-
-What's the median (50% percentile) for variable 'minimum_nights'?
-
-
-### Split the data
-
-* Shuffle the initial dataset, use seed `42`.
-* Split your data in train/val/test sets, with 60%/20%/20% distribution.
-* Make sure that the target value ('price') is not in your dataframe.
-* Apply the log transformation to the price variable using the `np.log1p()` function.
-
-
-### Question 3
-
-* We need to deal with missing values for the column from Q1.
-* We have two options: fill it with 0 or with the mean of this variable.
-* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
-* For computing the mean, use the training only!
-* Use the validation dataset to evaluate the models and compare the RMSE of each option.
-* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
-* Which option gives better RMSE?
-
-
-### Question 4
-
-* Now let's train a regularized linear regression.
-* For this question, fill the NAs with 0. 
-* Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
-* Use RMSE to evaluate the model on the validation dataset.
-* Round the RMSE scores to 2 decimal digits.
-* Which `r` gives the best RMSE?
-
-If there are multiple options, select the smallest `r`.
-
-
-### Question 5 
-
-* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
-* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
-* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
-* Fill the missing values with 0 and train a model without regularization.
-* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
-* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
-* Round the result to 3 decimal digits (`round(std, 3)`)
-
-
-> Note: Standard deviation shows how different the values are.
-> If it's low, then all values are approximately the same.
-> If it's high, the values are different. 
-> If standard deviation of scores is low, then our model is *stable*.
-
-
-### Question 6
-
-* Split the dataset like previously, use seed 9.
-* Combine train and validation datasets.
-* Fill the missing values with 0 and train a model with `r=0.001`. 
-* What's the RMSE on the test dataset?
-
-
-## Submit the results
-
-Submit your results here: https://forms.gle/2N9GkTr1AgNeZ8hD7.
-
-If your answer doesn't match options exactly, select the closest one.
-
-## Deadline
-
-
-The deadline for submitting is 20 September 2021, 17:00 CET. After that, the form will be closed.
-
-## Navigation
-
-* [Machine Learning Zoomcamp course](../)
-* [Session 2: Machine Learning for Regression](./)
-* Previous: [Explore more](17-explore-more.md)
+* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/)
+* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/)
diff --git a/course-zoomcamp/03-classification/homework.md b/course-zoomcamp/03-classification/homework.md
@@ -1,120 +1,4 @@
-## 3.15 Homework
+## Homework
 
-### Dataset
-
-In this homework, we will continue the New York City Airbnb Open Data. You can take it from
-[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
-or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
-if you don't want to sign up to Kaggle.
-
-We'll keep working with the `'price'` variable, and we'll transform it to a classification task.
-
-
-### Features
-
-For the rest of the homework, you'll need to use the features from the previous homework with additional two `'neighbourhood_group'` and `'room_type'`. So the whole feature set will be set as follows:
-
-* `'neighbourhood_group'`,
-* `'room_type'`,
-* `'latitude'`,
-* `'longitude'`,
-* `'price'`,
-* `'minimum_nights'`,
-* `'number_of_reviews'`,
-* `'reviews_per_month'`,
-* `'calculated_host_listings_count'`,
-* `'availability_365'`
-
-Select only them and fill in the missing values with 0.
-
-
-### Question 1
-
-What is the most frequent observation (mode) for the column `'neighbourhood_group'`?
-
-
-### Split the data
-
-* Split your data in train/val/test sets, with 60%/20%/20% distribution.
-* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
-* Make sure that the target value ('price') is not in your dataframe.
-
-
-### Question 2
-
-* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
-   * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
-* What are the two features that have the biggest correlation in this dataset?
-
-Example of a correlation matrix for the car price dataset:
-
-<img src="images/correlation-matrix.png" />
-
-
-### Make price binary
-
-* We need to turn the price variable from numeric into binary.
-* Let's create a variable `above_average` which is `1` if the price is above (or equal to) `152`.
-
-
-### Question 3
-
-* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
-* Which of these two variables has bigger score?
-* Round it to 2 decimal digits using `round(score, 2)`
-
-
-### Question 4
-
-* Now let's train a logistic regression
-* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
-* Fit the model on the training dataset.
-   * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
-   * `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
-* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
-
-
-### Question 5
-
-* We have 9 features: 7 numerical features and 2 categorical.
-* Let's find the least useful one using the *feature elimination* technique.
-* Train a model with all these features (using the same parameters as in Q4).
-* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
-* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
-* Which of following feature has the smallest difference? 
-   * `neighbourhood_group`
-   * `room_type` 
-   * `number_of_reviews`
-   * `reviews_per_month`
-
-> **note**: the difference doesn't have to be positive
-
-
-### Question 6
-
-* For this question, we'll see how to use a linear regression model from Scikit-Learn
-* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
-* Fit the Ridge regression model on the training data.
-* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
-* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.
-
-If there are multiple options, select the smallest `alpha`.
-
-
-## Submit the results
-
-Submit your results here: https://forms.gle/xGpZhoq9Efm9E4RA9
-
-It's possible that your answers won't match exactly. If it's the case, select the closest one.
-
-
-## Deadline
-
-The deadline for submitting is 27 September 2021, 17:00 CET. After that, the form will be closed.
-
-
-## Navigation
-
-* [Machine Learning Zoomcamp course](../)
-* [Session 3: Machine Learning for Classification](./)
-* Previous: [Explore more](14-explore-more.md)
+* For 2022 cohort homework, check [the 2022 cohort folder](../cohorts/2022/)
+* For 2021 cohort homework and solution, check [the 2021 cohort folder](../cohorts/2022/)