☄️Measure Me

Open Asteroid_updated_notebook.ipynb for the updated project. Open in colab for a better view.

Table of contents of the updated project:

⚙️ Data processing

👩‍🍳 Data preprocessing pipeline

📊🎨 Data Analysis

💡 "How do Albedo and H relate to diameter?"
Understanding other attributes associated to predicting diameter
💡 "How do Mean motion and Orbital Period relate to diameter?"
☠️ The threat posed by Near Earth Asteroids🌏
🔭Observing the asteroids
Rankings🥇🥈🥉
Studying the composition⚛️
Distributions and eccentricies💫

👾Playing around

Feature Engineering
Training on estimated diameter and spectral classes
Isolation forests🌲🌳
Stratified Sampling from scratch
📊Interactive Seaborn Widget

🤖Implementing ML models

Hyperparameter tuning using WandB sweeps🧹
Parralel coordinate plot for XGBoost model
Model Interpretation with Lime🍋

☄️Comparing the Models

👇This readme is for the older version of the project: Asteroid.ipynb

☄️Measure Me

ML models for Measuring Asteroids!!!

Notebook includes:

Preprocessed Data
Interactive Seaborn Widget 📊 (Download Notebook to access)
ML models
Comparison of models

Models Included:

Linear Regression
Lasso
Ridge
Support Vector Regressor
KNN
Decision Tree
Random Forest (With Hyperparameter Optimization)
XGBoost

The Process:

Data Processing

The dataset given to us was massive. Unlike anything, I've worked with before. It's big, not just in terms of data but also the number of variables, we had to train the model using. So I knew some serious work had to be done. I started by removing some unnecessary columns.

extent, GM, IR, BV, UB, G -> were removed because a majority of the values in these columns were missing.
spec_B, spec_T -> contained 34 classes. I wish I could encode at least the most frequent ones, but chose not to tamper with the data. It would have resulted in addition of multiple columns.
neo, pha, class, condition_code -> These factors are not related to diameter.

Next, I extracted all the rows in which the values for diameter and rot_per were not missing. Some values for diameter were non-numeric and hence were cleaned. Out of the columns left, very few had any missing values in them so I interpolated the dataset to fill in median values. The values were rounded to 5 decimal places as they were too big to be classified as float. Then I converted the whole dataset to float, to remove any left out non-numeric values.

Fitted some models to evaluate my work so far. The results were terrible as expected. Later on, when I was tinkering with some seaborn plots, I realized that I had totally forgotten to remove the outliers. What I did after this helped a great deal . . . . . .

Data Analysis

I researched and found out some bad methods of detecting outliers on the internet. But I stumbled upon something called pairplot. Upon further research and playing with code I learned that you can make widgets that you can interact with in the jupyter notebook itself. Some to's and fro's between coding and then upadting the code, got me to what you can access right now. Detecting and removing outliers had never been easier!!!!

With Outliers	Without Outliers

We can easily plot variables against each other, detect outliers and remove them with a single line of code (eg. df = df[df.a < 20])

Seaborn Widget (Download Notebook to access)

Removing outliers gave me a significant boost in model performance. My previous terrible results were now.... less terrible. For context, I was getting a mean square error of about 300 with linear regression previously. It was brought down to 217. Stratified sampling brought it down to.... 175.

Learn about Stratified Sampling

Stratifying and Splitting the data

An extra column called diameter_grp was created. This was a class column which contained values from 1 to 5 signifying how big the diameter of the asteroid was. The process is called Stratified Sampling and is done so that we can maintain the same proportion for each group(1-5) in both, the original dataframe and the splitted dataframe. The graph shows the number of values in each group:

Next, I performed stratified sampling using StratifiedShuffleSplit with respect to 'diameter_grp'. Lets check the proportions of each group in the original and splitted dataset. They look almost equal. This has a positive effect on our models.

Original proportions Training proportions

Previous Best:
Best Model: Random Forest.
MSE: 14.0625
RMSE: 3.7500
MAE: 1.2805

Now Best (After sampling):
Best Model: XGBoost.
MSE: 9.8679
RMSE: 3.1413
MAE: 1.2129

Trying out Models

Linear, Lasso, Ridge

Eight models have been implemented in the notebook. The first one being Linear Regression. It gives a bad result because the dataset is too complex and almost none of variables share a linear relation with diameter. So do Lasso and Ridge which work on the same principal. I also heard that scaling the data would give a better result when it comes to these algorithms. It turned out to be false in this case because scaling caused these 3 models to fail miserably.

SVR, KNN

SVR performs the worst among all the models. It also ran slower than all the models I had implemented until then. KNN did a good job overall. Ran quickly and produced an error(MSE) of 164.1.

Decision Tree, RandomForest, XGBoost

Decision tree gives great results. Setting the tree depth to 4 gives the least error and the lowest runtime. RandomForest beat Decision tree by a quite a bit. It was the reigning champion for a long time. I decided to run hyperparameter optimization on RandomForest but it produced no significant improvements.

Decision Tree

H shared a significant relation with diameter. This is clearly reflected in the tree.

Later I found something called pruning in Decision tree to reduce overfitting. I tried running cost complexity pruning on my decision tree but it once again produced nothing significant. Nevertheless, it's a great way to reduce overfitting and might work wonders on another dataset.

Know more: https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
XGboost outperformed RandomForest, and is currently giving the lowest error. All these algorithms do not require data to be scaled.

Final Results

The final day results were as follows.....

Table:

Model	MSE	RMSE	MAE
Linear Regression	175.66	13.25	8.43
Lasso	194.45	13.94	8.79
Ridge	175.25	13.23	8.41
SVR	209.44	14.47	6.34
KNN	190.45	13.80	7.48
Decision Tree	30.65	5.53	2.99
Random Forest	10.40	3.22	1.17
Random Forest (Tuned)	17.28	4.15	2.29
XGBoost	9.86	3.14	1.21

Chart

Day 3	Final Day

Best Model: XGBoost.
MSE: 9.8679
RMSE: 3.1413
MAE: 1.2129

Libraries used:

Seaborn
xgboost
matplotlib
sklearn
ipywidgets
numpy
pandas

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
images		images
Asteroid.ipynb		Asteroid.ipynb
Asteroid_updated_notebook.ipynb		Asteroid_updated_notebook.ipynb
Asteroid_with_Deep_Learning_using_fast_ai.ipynb		Asteroid_with_Deep_Learning_using_fast_ai.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of contents of the updated project:

⚙️ Data processing

📊🎨 Data Analysis

👾Playing around

🤖Implementing ML models

☄️Comparing the Models

👇This readme is for the older version of the project: Asteroid.ipynb

☄️Measure Me

ML models for Measuring Asteroids!!!

Notebook includes:

Models Included:

The Process:

Data Processing

Data Analysis

Seaborn Widget (Download Notebook to access)

Stratifying and Splitting the data

Trying out Models

Linear, Lasso, Ridge

SVR, KNN

Decision Tree, RandomForest, XGBoost

Decision Tree

H shared a significant relation with diameter. This is clearly reflected in the tree.

Final Results

Libraries used:

☄🪐🌖🌠☄👽🚀🌟

About

Releases

Packages

Languages

harshjadhav890/Asteroid-Diameter

Folders and files

Latest commit

History

Repository files navigation

Table of contents of the updated project:

⚙️ Data processing

📊🎨 Data Analysis

👾Playing around

🤖Implementing ML models

☄️Comparing the Models

👇This readme is for the older version of the project: Asteroid.ipynb

☄️Measure Me

ML models for Measuring Asteroids!!!

Notebook includes:

Models Included:

The Process:

Data Processing

Data Analysis

Seaborn Widget (Download Notebook to access)

Stratifying and Splitting the data

Trying out Models

Linear, Lasso, Ridge

SVR, KNN

Decision Tree, RandomForest, XGBoost

Decision Tree

H shared a significant relation with diameter. This is clearly reflected in the tree.

Final Results

Libraries used:

☄🪐🌖🌠☄👽🚀🌟

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages