Skip to content

harshjadhav890/Asteroid-Diameter

Repository files navigation

Open Asteroid_updated_notebook.ipynb for the updated project. Open in colab for a better view.

Table of contents of the updated project:

⚙️ Data processing

  • 👩‍🍳 Data preprocessing pipeline

📊🎨 Data Analysis

  • 💡 "How do Albedo and H relate to diameter?"
  • Understanding other attributes associated to predicting diameter
  • 💡 "How do Mean motion and Orbital Period relate to diameter?"
  • ☠️ The threat posed by Near Earth Asteroids🌏
  • 🔭Observing the asteroids
  • Rankings🥇🥈🥉
  • Studying the composition⚛️
  • Distributions and eccentricies💫

👾Playing around

  • Feature Engineering
  • Training on estimated diameter and spectral classes
  • Isolation forests🌲🌳
  • Stratified Sampling from scratch
  • 📊Interactive Seaborn Widget

🤖Implementing ML models

  • Hyperparameter tuning using WandB sweeps🧹
  • Parralel coordinate plot for XGBoost model
  • Model Interpretation with Lime🍋

☄️Comparing the Models



👇This readme is for the older version of the project: Asteroid.ipynb

☄️Measure Me

ML models for Measuring Asteroids!!!

Notebook includes:

  • Preprocessed Data
  • Interactive Seaborn Widget 📊 (Download Notebook to access)
  • ML models
  • Comparison of models

Models Included:

  • Linear Regression
  • Lasso
  • Ridge
  • Support Vector Regressor
  • KNN
  • Decision Tree
  • Random Forest (With Hyperparameter Optimization)
  • XGBoost

The Process:

Data Processing

The dataset given to us was massive. Unlike anything, I've worked with before. It's big, not just in terms of data but also the number of variables, we had to train the model using. So I knew some serious work had to be done. I started by removing some unnecessary columns.

  • extent, GM, IR, BV, UB, G -> were removed because a majority of the values in these columns were missing.
  • spec_B, spec_T -> contained 34 classes. I wish I could encode at least the most frequent ones, but chose not to tamper with the data. It would have resulted in addition of multiple columns.
  • neo, pha, class, condition_code -> These factors are not related to diameter.

Next, I extracted all the rows in which the values for diameter and rot_per were not missing. Some values for diameter were non-numeric and hence were cleaned. Out of the columns left, very few had any missing values in them so I interpolated the dataset to fill in median values. The values were rounded to 5 decimal places as they were too big to be classified as float. Then I converted the whole dataset to float, to remove any left out non-numeric values.

Fitted some models to evaluate my work so far. The results were terrible as expected. Later on, when I was tinkering with some seaborn plots, I realized that I had totally forgotten to remove the outliers. What I did after this helped a great deal . . . . . .

Data Analysis

I researched and found out some bad methods of detecting outliers on the internet. But I stumbled upon something called pairplot. Upon further research and playing with code I learned that you can make widgets that you can interact with in the jupyter notebook itself. Some to's and fro's between coding and then upadting the code, got me to what you can access right now. Detecting and removing outliers had never been easier!!!!

With Outliers Without Outliers

  • We can easily plot variables against each other, detect outliers and remove them with a single line of code (eg. df = df[df.a < 20])

Seaborn Widget (Download Notebook to access)



Removing outliers gave me a significant boost in model performance. My previous terrible results were now.... less terrible. For context, I was getting a mean square error of about 300 with linear regression previously. It was brought down to 217. Stratified sampling brought it down to.... 175.

Learn about Stratified Sampling

Stratifying and Splitting the data

An extra column called diameter_grp was created. This was a class column which contained values from 1 to 5 signifying how big the diameter of the asteroid was. The process is called Stratified Sampling and is done so that we can maintain the same proportion for each group(1-5) in both, the original dataframe and the splitted dataframe. The graph shows the number of values in each group:



Next, I performed stratified sampling using StratifiedShuffleSplit with respect to 'diameter_grp'. Lets check the proportions of each group in the original and splitted dataset. They look almost equal. This has a positive effect on our models.

Original proportions Training proportions



Previous Best:
Best Model: Random Forest.
MSE: 14.0625
RMSE: 3.7500
MAE: 1.2805

Now Best (After sampling):
Best Model: XGBoost.
MSE: 9.8679
RMSE: 3.1413
MAE: 1.2129

Trying out Models

Linear, Lasso, Ridge

Eight models have been implemented in the notebook. The first one being Linear Regression. It gives a bad result because the dataset is too complex and almost none of variables share a linear relation with diameter. So do Lasso and Ridge which work on the same principal. I also heard that scaling the data would give a better result when it comes to these algorithms. It turned out to be false in this case because scaling caused these 3 models to fail miserably.

SVR, KNN

SVR performs the worst among all the models. It also ran slower than all the models I had implemented until then. KNN did a good job overall. Ran quickly and produced an error(MSE) of 164.1.

Decision Tree, RandomForest, XGBoost

Decision tree gives great results. Setting the tree depth to 4 gives the least error and the lowest runtime. RandomForest beat Decision tree by a quite a bit. It was the reigning champion for a long time. I decided to run hyperparameter optimization on RandomForest but it produced no significant improvements.

Decision Tree

H shared a significant relation with diameter. This is clearly reflected in the tree.



Later I found something called pruning in Decision tree to reduce overfitting. I tried running cost complexity pruning on my decision tree but it once again produced nothing significant. Nevertheless, it's a great way to reduce overfitting and might work wonders on another dataset.

Know more: https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
XGboost outperformed RandomForest, and is currently giving the lowest error. All these algorithms do not require data to be scaled.


Final Results

The final day results were as follows.....
  • Table:
Model MSE RMSE MAE
Linear Regression 175.66 13.25 8.43
Lasso 194.45 13.94 8.79
Ridge 175.25 13.23 8.41
SVR 209.44 14.47 6.34
KNN 190.45 13.80 7.48
Decision Tree 30.65 5.53 2.99
Random Forest 10.40 3.22 1.17
Random Forest (Tuned) 17.28 4.15 2.29
XGBoost 9.86 3.14 1.21
  • Chart
Day 3 Final Day

Best Model: XGBoost.
MSE: 9.8679
RMSE: 3.1413
MAE: 1.2129

Libraries used:

  • Seaborn

  • xgboost

  • matplotlib

  • sklearn

  • ipywidgets

  • numpy

  • pandas



☄🪐🌖🌠☄👽🚀🌟

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published