Open Asteroid_updated_notebook.ipynb for the updated project. Open in colab for a better view.
- 👩🍳 Data preprocessing pipeline
- 💡 "How do Albedo and H relate to diameter?"
- Understanding other attributes associated to predicting diameter
- 💡 "How do Mean motion and Orbital Period relate to diameter?"
- ☠️ The threat posed by Near Earth Asteroids🌏
- 🔭Observing the asteroids
- Rankings🥇🥈🥉
- Studying the composition⚛️
- Distributions and eccentricies💫
- Feature Engineering
- Training on estimated diameter and spectral classes
- Isolation forests🌲🌳
- Stratified Sampling from scratch
- 📊Interactive Seaborn Widget
- Hyperparameter tuning using WandB sweeps🧹
- Parralel coordinate plot for XGBoost model
- Model Interpretation with Lime🍋
- Preprocessed Data
- Interactive Seaborn Widget 📊 (Download Notebook to access)
- ML models
- Comparison of models
- Linear Regression
- Lasso
- Ridge
- Support Vector Regressor
- KNN
- Decision Tree
- Random Forest (With Hyperparameter Optimization)
- XGBoost
- extent, GM, IR, BV, UB, G -> were removed because a majority of the values in these columns were missing.
- spec_B, spec_T -> contained 34 classes. I wish I could encode at least the most frequent ones, but chose not to tamper with the data. It would have resulted in addition of multiple columns.
- neo, pha, class, condition_code -> These factors are not related to diameter.
Next, I extracted all the rows in which the values for diameter and rot_per were not missing. Some values for diameter were non-numeric and hence were cleaned. Out of the columns left, very few had any missing values in them so I interpolated the dataset to fill in median values. The values were rounded to 5 decimal places as they were too big to be classified as float. Then I converted the whole dataset to float, to remove any left out non-numeric values.
Fitted some models to evaluate my work so far. The results were terrible as expected. Later on, when I was tinkering with some seaborn plots, I realized that I had totally forgotten to remove the outliers. What I did after this helped a great deal . . . . . .
I researched and found out some bad methods of detecting outliers on the internet. But I stumbled upon something called pairplot. Upon further research and playing with code I learned that you can make widgets that you can interact with in the jupyter notebook itself. Some to's and fro's between coding and then upadting the code, got me to what you can access right now. Detecting and removing outliers had never been easier!!!!- We can easily plot variables against each other, detect outliers and remove them with a single line of code (eg. df = df[df.a < 20])
Removing outliers gave me a significant boost in model performance. My previous terrible results were now.... less terrible. For context, I was getting a mean square error of about 300 with linear regression previously. It was brought down to 217. Stratified sampling brought it down to.... 175.
Learn about Stratified Sampling
An extra column called diameter_grp was created. This was a class column which contained values from 1 to 5 signifying how big the diameter of the asteroid was. The process is called Stratified Sampling and is done so that we can maintain the same proportion for each group(1-5) in both, the original dataframe and the splitted dataframe. The graph shows the number of values in each group:
Next, I performed stratified sampling using StratifiedShuffleSplit with respect to 'diameter_grp'. Lets check the proportions of each group in the original and splitted dataset. They look almost equal. This has a positive effect on our models.
Original proportions
Training proportions
Previous Best:
Best Model: Random Forest.
MSE: 14.0625
RMSE: 3.7500
MAE: 1.2805
Now Best (After sampling):
Best Model: XGBoost.
MSE: 9.8679
RMSE: 3.1413
MAE: 1.2129
Original proportions | Training proportions |
---|---|
Previous Best:
Best Model: Random Forest.
MSE: 14.0625
RMSE: 3.7500
MAE: 1.2805
Now Best (After sampling):
Best Model: XGBoost.
MSE: 9.8679
RMSE: 3.1413
MAE: 1.2129
Later I found something called pruning in Decision tree to reduce overfitting. I tried running cost complexity pruning on my decision tree but it once again produced nothing significant. Nevertheless, it's a great way to reduce overfitting and might work wonders on another dataset.
Know more:
https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
XGboost outperformed RandomForest, and is currently giving the lowest error. All these algorithms do not require data to be scaled.
The final day results were as follows.....
- Table:
Model | MSE | RMSE | MAE |
---|---|---|---|
Linear Regression | 175.66 | 13.25 | 8.43 |
Lasso | 194.45 | 13.94 | 8.79 |
Ridge | 175.25 | 13.23 | 8.41 |
SVR | 209.44 | 14.47 | 6.34 |
KNN | 190.45 | 13.80 | 7.48 |
Decision Tree | 30.65 | 5.53 | 2.99 |
Random Forest | 10.40 | 3.22 | 1.17 |
Random Forest (Tuned) | 17.28 | 4.15 | 2.29 |
XGBoost | 9.86 | 3.14 | 1.21 |
- Chart
Day 3 | Final Day |
---|---|
Best Model: XGBoost.
MSE: 9.8679
RMSE: 3.1413
MAE: 1.2129
-
Seaborn
-
xgboost
-
matplotlib
-
sklearn
-
ipywidgets
-
numpy
-
pandas