-
-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added dropna to avoid crash on nan values #275
Closed
AlexanderZender
wants to merge
9
commits into
oegedijk:master
from
AlexanderZender:fix-for-nan-in-categorical-types-and-value-manipulation-prevention-by-models
Closed
Changes from 7 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
2541e27
added dropna to avoid crash on nan values
AlexanderZender 723b4b1
added nan value to categorical features
AlexanderZender be6cfc0
added conversion for string NaN from frontend
AlexanderZender b318e04
added test for nan categorical
AlexanderZender 6d052b7
added more acc classes in dataset and dashboard generation in NaN cat…
AlexanderZender 3a9374f
changed used dataset to titanic
AlexanderZender f98eae5
removed one hot encoder
AlexanderZender 1f459a7
removed unecessary copy
AlexanderZender 5359bc0
Merge branch 'master' into fix-for-nan-in-categorical-types-and-value…
AlexanderZender File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
from sklearn.ensemble import RandomForestClassifier | ||
import pandas as pd | ||
from explainerdashboard import ClassifierExplainer, ExplainerDashboard | ||
from sklearn.preprocessing import LabelEncoder | ||
from sklearn.model_selection import train_test_split | ||
import os | ||
import numpy as np | ||
|
||
class CategoricalModelWrapper: | ||
def __init__(self, model, categorical_label_test) -> None: | ||
self._model = model | ||
self._categorical_label_test = categorical_label_test | ||
pass | ||
|
||
def _perform_label_encoding(self, y): | ||
label_enc = LabelEncoder() | ||
label_enc.fit([["Survived"],["Not Survived"]]) | ||
return pd.Series(label_enc.transform(y.values), name=y.name, index=y.index) | ||
|
||
def _perform_label_decoding(self, y): | ||
label_enc = LabelEncoder() | ||
label_enc.fit([["Survived"],["Not Survived"]]) | ||
return pd.Series(label_enc.inverse_transform(y), name=y.name) | ||
|
||
def _preprocessor(self, X): | ||
return X.drop(["Name"], axis=1) | ||
|
||
def _postprocessor(self, y): | ||
if self._categorical_label_test == True: | ||
y = self._perform_label_decoding(y) | ||
return y | ||
|
||
def predict(self, X): | ||
X = self._preprocessor(X) | ||
y = self._model.predict(X) | ||
return self._postprocessor(y) | ||
|
||
def predict_proba(self, X): | ||
X = self._preprocessor(X) | ||
probabilities_raw = self._model.predict_proba(X) | ||
return probabilities_raw | ||
|
||
def generate_categorical_dataset_model_wrapper(categorical_label_test=False): | ||
model = RandomForestClassifier(n_estimators=5, max_depth=2) | ||
wrapper = CategoricalModelWrapper(model, categorical_label_test) | ||
df = pd.read_csv(os.path.join(os.getcwd(), "tests\\test_assets\\data.csv")) | ||
if categorical_label_test == True: | ||
#Test for categorical label, convert titanic binary numeric label to categorical ["Survived"],["Not Survived"] | ||
df["Survival"] = wrapper._perform_label_decoding(df["Survival"]) | ||
else: | ||
#We only test NaN in categorical features and numerical target | ||
df["Name"][0] = np.nan | ||
df["Name"][10] = np.nan | ||
df["Name"][20] = np.nan | ||
df["Name"][30] = np.nan | ||
df["Name"][40] = np.nan | ||
df["Name"][50] = np.nan | ||
df["Name"][60] = np.nan | ||
df["Name"][70] = np.nan | ||
df["Name"][80] = np.nan | ||
X_train, X_test, y_train, y_test = train_test_split(df.drop(["Survival"], axis=1), df["Survival"], test_size=0.2, random_state=42) | ||
|
||
X_train = wrapper._preprocessor(X_train) | ||
|
||
if categorical_label_test == True: | ||
y_train = wrapper._perform_label_encoding(y_train) | ||
|
||
model.fit(X_train, y_train) | ||
return CategoricalModelWrapper(model, categorical_label_test), X_test, y_test | ||
|
||
def test_NaN_containing_categorical_dataset(): | ||
_wrapper, _test_X, _test_y = generate_categorical_dataset_model_wrapper() | ||
explainer = ClassifierExplainer( | ||
_wrapper, _test_X, _test_y) | ||
dashboard = ExplainerDashboard(explainer) | ||
assert "NaN" in explainer.categorical_dict["Name"] | ||
|
||
def test_categorical_label(): | ||
_wrapper, _test_X, _test_y = generate_categorical_dataset_model_wrapper(True) | ||
explainer = ClassifierExplainer( | ||
_wrapper, _test_X, _test_y) | ||
dashboard = ExplainerDashboard(explainer) | ||
assert "Survived" in explainer.labels |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed as we already have the .copy in line 654