Hello Everyone,
Here is my classification project based on predicting whether a passenger is transported to an alternate dimension or not.
I used Spaceship Titanic Dataset which is uploaded by Kaggle on their website in competition menu.
Link to the Dataset : Spaceship Titanic Dataset
-
To predict whether a passenger was transported to an alternate dimension during the spaceship collision with the spacetime anomaly.
-
To make these predictions, we have a set of personal records recovered from the ship's damaged computer system.
-
For my project, I have created a streamlit web app for finding out if a passenger was transported to an alternate dimensionin or not.
-
This web app is multi-pages, means you can navigate to different pages through dropdown menu in the sidebar.
-
First Page is home page, which contains the problem statement and information about the dataset.
-
Second Page is web application page, which contains the web application itself used to classify passengers.
-
-
It also contains a contribution section in the sidebar, which lets you contribute to the project by Giving Stars, Forking the Repository and Download ZIP file of the entire Project.
Link to the Web App : Spaceship Titanic App
- Setting up the Enviroment
- Libraries required for the Project
- Getting started with Repository
- Steps involved in the Project
- Conclusion
Jupyter Notebook is required for this project and you can install and set it up in the terminal.
- Install the Notebook
pip install notebook
- Run the Notebook
jupyter notebook
Pandas
- Go to the terminal and run this code
pip install pandas
- Go to Jupyter Notebook and run this code from a cell
!pip install pandas
Matplotlib
- Go to the terminal and run this code
pip install matplotlib
- Go to Jupyter Notebook and run this code from a cell
!pip install matplotlib
Seaborn
- Go to the terminal and run this code
pip install seaborn
- Go to Jupyter Notebook and run this code from a cell
!pip install seaborn
Sklearn
- Go to the terminal and run this code
pip install scikit-learn
- Go to Jupyter Notebook and run this code from a cell
!pip install scikit-learn
- Clone this repository to your local machine by using the following command :
git clone https://github.com/TheMrityunjayPathak/SpaceshipTitanicClassification.git
Data Cleaning
-
First of all I dropped some unnecessary column from our dataset i.e Name, PassengerId, Cabin.
-
Then I found nan values in age column which I filled with median of age by fillna() method.
-
Then I count the maximum occuring element from VIP, Destination, HomePlanet, CryoSleep column and used them to fill the nan values present in those columns.
-
After that I transformed Transported column to 'int' DataType.
-
Then I dropped all the nan values from RoomService, FoodCourt, ShoppingMall, Spa, VRDeck column.
Data Visualization
-
I used countplot() to visualize all the categorical variables from the dataset by using sns.countplot() method.
-
Number of passengers transported vs not transpoprted to their planets
- Number of passengers with their respective HomePlanet
- Number of passengers opted for VIP service
- Number of passengers with their respective destinations
Dummy Variable
-
I created dummy variables for HomePlanet and Destination column and stored them into their individual DataFrame and then concatenated them into our orignal DataFrame.
-
Then I dropped the HomePlanet and Destination Column as it is of no use now.
Data Standardization
-
I used StandardScaler to scale the data to a particular scale instead of random values.
-
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck column are the columns which get standardized and then I stored them into a DataFrame.
-
Then I dropped the unscaled columns and concatenated the scaled DataFrame into our orignal DataFrame.
Imbalance Data
-
After that I found that VIP column is highly imbalance which can reduce our model accuracy.
-
So I divided our DataFrame into 2 parts based on people who opted for VIP service or not.
-
Then I upscaled the VIP people to non VIP people as number of VIP people is much lesser than the number of non VIP people.
-
Then I used sns.countplot() to verify that both the values are evenly balanced.
Model Creation
-
Firstly I have definied dependent and independent variables for our traning and testing.
-
Then I split the data into traning and testing set by using train_test_split.
-
Then I fit the model with X_train and y_train on support vector machine and random forest classifier and checked the score.
-
After that I used kfold_cross_validation for further testing the accuracy of our model.
-
So I cheked mean cross_val_score of both svm and random forest classifier for best score.
-
Implemented random forest classifier and svm model to achieve an accuracy score of 88% and 83% respectively.
-
Validated random forest classifier with a mean cross_val_score of 88% and demonstrated its superior robustness compared to svm with a mean cross_val_score of 81%.