-
-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added dropna to avoid crash on nan values #275
added dropna to avoid crash on nan values #275
Conversation
added copy to any predict or predict_proba call to only pass definitiv copies
During the initial opening sequence of the dashboard, a prediction is made using all nan values: This causes an exception in the console that the encoder of the model can not deal with the passed values. |
I added a 'NaN' category value to the categorical_dict if the categorical feature has NaN as values. |
I'm not sure why we are adding all those .copy() to the prediction calls? prediction shouldn't have any side effects on X, so why is this needed? |
could you add some tests that shows this works? |
The copy is technically not needed but it will avoid errors, for example if you pass a pipeline that manipulate the incoming dataframe within its process. You can let the user manage that, but it may not always be possible to perform a copy command within the pipeline or own code. Edit: as your comment asked about the predict function, I was unsure if this problem also occurred there. |
removed copy from predict function calls, added test for testing categorical labels
@oegedijk I added the requested test, and removed copy from all predict function calls. I'm not completely sure how many changes are needed to allow categorical labels, but the test is already there with a dataset to test with. The dataset is an adjusted subset of the car dataset from OpenML |
I guess all these predict functions are only predicting a single row, so maybe the cost of adding these copy's is not so bad? Alternative would be only doing them for pipelines, but that would also introduce additional overhead. |
I think it should be possible to reuse the titanic test set? Just replace the 0,1 labels with 'survived', 'not survived'. For testing the NaN's, we could just randomly sprinkle some NaNs in. there. What would be the fastest/cheapest model that works well with missing values to minimize test time? (HistGradientBoostingClassifier comes to min, but maybe there are faster options) |
I think the overhead depends on the complexity of the dataset that gets used. |
That would be possible if this does not cause issues with other tests. As for the training time, I just used a RandomForestClassifier which trains in less than a second on my system. |
I updated the test to use the available Titanic data. |
Ah, apologies. I didn't fully understand then, I thought you were thinking of algorithms that are able to deal with missing values (such as e.g. HistGradientBoostingClassifier), but you mean pipelines that fill in NaNs. Both might be something we should be able to support. |
Will have a closer look at the code hopefully tomorrow or this weekend. But thanks already for this contribution, let's try to get it ready and released quickly! |
Sounds good, tell me if you find any issue with it. |
@@ -572,7 +572,7 @@ def one_vs_all_metric(metric, pos_label, y_true, y_pred): | |||
sign = 1 if greater_is_better else -1 | |||
|
|||
def _scorer(clf, X, y): | |||
y_pred = clf.predict_proba(X) | |||
y_pred = clf.predict_proba(X.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed as we already have the .copy in line 654
You are correct, i removed it |
…-manipulation-prevention-by-models
@oegedijk Totally forgot that there was a conflict, ups |
@oegedijk Any info when this would be looked at? |
added copy to any predict or predict_proba call to only pass definitiv copies.
fixed issue when categorical features have nan as values, a crash occurs
added the option to allow nan category values in frontend
issue: #273